<<

arXiv:1502.06974v1 [astro-ph.IM] 24 Feb 2015 D)gi em(hc a easaa ope an or gain, complex scalar a be can (which 2 direction-independent term single employs gain Tra- a calibration (DI) sky. with 2GC) model the known instrumental or of a an generation, model using (second constructed) gains, ditional iteratively antenna or complex solv (prior, unknown of consists the calibration for gain interferometry, radio In INTRODUCTION cetd21 eray2.Rcie 05Fbur 3 nori in 13; February 2015 Received 24. February 2015 Accepted o.Nt .Ato.Soc. Astron. R. Not. Mon. ..Smirnov O.M. complex a problem as optimization calibration gain interferometric Radio ⋆ th of calibration to specifically ourselves wor restrict this will In overview. we recent 1996); a interferome- al. provides imple- (2011a,b,c) radio et Smirnov Hamaker the and see of (RIME, proposed equation framework measurement Dif- try been the screens). in have pri- mostly this ionospheric mented, (e.g. to offsets, model approaches pointing ferent instrumental terms, beams, parameterized gain mary some DD solvable by independently or can by which represented effects, be (DD) direction-dependent addresses also calibra (3GC) Third-generation interval. time/frequency 2 1 3 × K ot fia r lo,TePr,Pr od iead,7 Pinelands, Road, Park Park, The Floor, 3rd University, Africa, Rhodes South Electronics, SKA and Physics of Department EI bevtied ai,CR,Uiest ai Diderot Paris Université CNRS, Paris, de Observatoire GEPI, -al [email protected] E-mail: 2 complex-valued 12 oe matrix Jones ⋆ .Tasse C. , 000 –8(04 rne 6Fbur 05(NL (MN 2015 February 26 Printed (2014) 1–18 , eia,Tcnqe:interferometric Techniques: merical, words: Key formalism. the from derived gorithm h otx ftegnrlfraim ial,w rsn an present we of Finally, results formalism. general applied the of context the lmnain;w hwta oercnl rpsdcalibra proposed recently comp some in that StefCal results show inversion we describi matrix plementations; for approximate calculus Using matrix furt regime. We operator-based algori regime. an calibration gain to of calculus direction-dependent family and new polarized con whole the with a combined yields when approaches, cal formalism, gain Jacobian interferometric complex radio derivat Jaco general of real Wirtinger problem conventional the the the to velopments called to counterpart formalism fu Jacobian func a complex real-valued complex employs s of This of domain extended able. the optimization have into etc.) least-squares theory Levenberg-Marquardt, for optimization in gorithms developments Recent ABSTRACT e nen,prsome per antenna, per ) 31 n eln a eudrto sseilcsso hs n pl and this, of cases special as understood be can peeling and ntuetto:itreoees ehd:analytical, Methods: interferometers, Instrumentation: CohJones DI e tion ing ia om21 coe 30 October 2014 form ginal OBx9,Gaason 10SuhAfrica South 6140 Grahamstown, 94, Box PO k lc ue ase,910Muo,France Meudon, 92190 Janssen, Jules place 5 , 0 ot Africa South 405 nte pcaie ieto-eedn airto al calibration direction-dependent specialized another , ic ope ieetaincnb endi nyavery a only variables, particular, in real (in defined of sense be functions restricted can differentiation to complex (LM). restricted since such Levenberg-Marquardt been 2004), have and al. These et (GN) Madsen Gauss-Newton see overview, as gradient-based ap- an al- various Traditional (for involve techniques is problems 2013). NLLS Yatawatta visibilities to & been proaches Kazemi observed have by treatments on other proposed (though noise Gaussian always the most since problem, solved being of sense the in direction). latter per (the independently terms gains DD and outdfiiino ope rdetoeao.Ti leads This operator. gradient the complex a called of shown definition robust formalism have a 2012) al. using et Laurent that 2009; (Kreutz-Delgado as gains prob- the real of parts a variables. imaginary as real and has problem real independent conundrum the NLLS treating this complex by the lem of recast out to way been traditional comple are the interferometry radio variables: in Gains definition). usual anclbaini o-ierlatsurs(NLLS) squares least non-linear a is calibration Gain eetdvlpet notmzto theory optimization in developments Recent Wrigr12)alw o mathematically a for allows 1927) (Wirtinger A T E tl l v2.2) file style X gteplrzdcalibration polarized the ng bain n hwhwthe how show and ibration, ∂ mlmnainadsome and implementation hs nldn hs for those including thms, e xedteWirtinger the extend her in eapyteede- these apply We bian. z/∂z cin (Gauss-Newton, nctions ¯ in facmlxvari- complex a of tions ttoal ffiin im- efficient utationally inagrtm uhas such algorithms tion v,addrvsafull- a derives and ive, etoa optimization ventional m rdtoa al- traditional ome osnteiti the in exist not does itne complex Wirtinger ehd:nu- Methods: c hmin them ace x - 2 O.M. Smirnov & C. Tasse to the construction of a complex Jacobian J, which in Table 1. Notation and frequently used symbols turn allows for traditional NLLS algorithms to be directly applied to the complex case. We summarize these x scalar value x developments and introduce basic notation in Sect. 1. x¯ complex conjugate In Sect. 2, we follow on from Tasse (2014) to apply this x vector x theory to the RIME, and derive complex Jacobians for X matrix X X X (unpolarized) DI and DD gain calibration. vector of 2 × 2 matrices =[Xi] (Sect. 4) R space of real numbers C space of complex numbers In principle, the use of Wirtinger calculus and complex I identity matrix Jacobians ultimately results in the same system of LS equa- diag x diagonal matrix formed from x tions as the real/imaginary approach. It does offer two im- ||· ||F Frobenius norm portant advantages: (i) equations with complex variables are (·)T transpose more compact, and are more natural to derive and analyze (·)H Hermitian transpose than their real/imaginary counterparts, and (ii) the struc- ⊗ outer product a.k.a. Kronecker product ¯ ture of the complex Jacobian can yield new and valuable x¯, X element-by-element complex conjugate of x, X x X insights into the problem. This is graphically illustrated in x˘, X˘ augmented vectors x˘ = , X˘ = i x¯ XH Fig. 1 (in fact, this figure may be considered the central    i  X upper half of matrix X insight of this paper). Methods such as GN and LM hinge U X , X left, right half of matrix X around a large matrix – J H J – with dimensions correspond- L R XUL upper left quadrant of matrix X ing to the number of free parameters; construction and/or Y Y Y order of operations is XU = XU = (XU) , inversion of this matrix is often the dominant algorithmic Y Y or X U = (X )U cost. If J H J can be treated as (perhaps approximately) d, v, r, g, m data, model, residuals, gains, sky coherency sparse, these costs can be reduced, often drastically. Figure 1 (·)k value associated with iteration k H shows the structure of an example J J matrix for a DD gain (·)p,k value associated with antenna p, iteration k calibration problem. The left column row shows versions of (·)(d) value associated with direction d J H J constructed via the real/imaginary approach, for four W matrix of weights different orderings of the solvable parameters. None of the Jk, J k∗ partial, conjugate partial Jacobian at iteration k J full complex Jacobian orderings yield a matrix that is particularly sparse or easily H, H˜ J H J and its approximation invertible. The right column shows a complex J H J for the vec X vectorization operator same orderings. Panel (f) reveals sparsity that is not appar- RA right-multiply by A operator (Sect. 4) ent in the real/imaginary approach. This sparsity forms the LA left-multiply by A operator (Sect. 4) basis of a new fast DD calibration algorithm discussed later i δj Kronecker delta symbol in the paper. A B matrix blocks "C D#

In Sect. 3, we show that different algorithms may be de- ց, ր, ↓ repeated matrix block rived by combining different sparse approximations to J H J with conventional GN and LM methods. In particular, we show that StefCal, a fast DI calibration algorithm recently 1 WIRTINGER CALCULUS & COMPLEX proposed by Salvini & Wijnholds (2014a), can be straight- LEAST-SQUARES forwardly derived from a diagonal approximation to a com- The traditional approach to optimizing a of n com- plex J H J. We show that the complex Jacobian approach plex variables f(z), z ∈ Cn is to treat the real and imaginary naturally extends to the DD case, and that other sparse parts z = x + iy independently, turning f into a function approximations yield a whole family of DD calibration algo- of 2n real variables f(x, y), and the problem into an opti- rithms with different scaling properties. One such algorithm, mization over R2n. CohJones (Tasse 2014), has been implemented and success- Kreutz-Delgado (2009) and Laurent et al. (2012) pro- fully applied to simulated LOFAR data: this is discussed in pose an alternative approach to the problem based on Sect. 6. Wirtinger (1927) calculus. The central idea of Wirtinger cal- culus is to treat z and z¯ as independent variables, and op- In Sect. 4 we extend this approach to the fully polar- timize f(z, z¯) using the Wirtinger ized case, by developing a Wirtinger-like operator calculus ∂ 1 ∂ ∂ ∂ 1 ∂ ∂ in which the polarization problem can be formulated suc- = − i , = + i , (1.1) ∂z 2 ∂x ∂y ∂z¯ 2 ∂x ∂y cinctly. This naturally yields fully polarized counterparts to     the calibration algorithms defined previously. In Sect. 5, we where z = x + iy. It is easy to see that discuss other algorithmic variations, and make connections ∂z¯ ∂z to older DD calibration techniques such as peeling (Noordam = = 0, (1.2) 2004). ∂z ∂z¯ i.e. that z¯ (z) is treated as constant when taking the deriva- tive with respect to z (z¯). From this it is straightforward to While the scope of this work is restricted to LS solutions to the DI and DD gain calibration problem, the potential ap- define the complex gradient operator plicability of complex optimization to radio interferometry ∂ ∂ ∂ ∂ ∂ ∂ ∂ = , = ,..., , ,..., , (1.3) is perhaps broader. We will return to this in the conclusions. ∂C z ∂z ∂z¯ ∂z ∂z ∂z¯ ∂z¯    1 n 1 n  Radio interferometric gain calibration as a complex optimization problem 3 from which definitions of the complex Jacobian and complex Laurent et al. (2012) show that Eqs. 1.8 and 1.9 yield Hessians naturally follow. The authors then show that vari- exactly the same system of LS equations as would have been ous optimization techniques developed for real functions can produced had we treated r(z) as a function of real and imag- be reformulated using complex Jacobians and Hessians, and inary parts r(x, y), and taken ordinary derivatives in R2n. applied to the complex optimization problem. In particu- However, the complex Jacobian may be easier and more el- lar, they generalize the Gauss-Newton (GN) and Levenberg- egant to derive analytically, as we’ll see below in the case of Marquardt (LM) methods for solving the non-linear least radio interferometric calibration. squares (NLLS) problem1

min ||r(z, z¯)||F , or min ||d − v(z, z¯)||F (1.4) z z 2 SCALAR (UNPOLARIZED) CALIBRATION Cm where r, d, v have values in , and || · ||F is the Frobenius In this section we will apply the formalism above to the norm. The latter form refers to LS fitting of the parame- scalar case, i.e. that of unpolarized calibration. This will terized model v to observed data d, and is the preferred then be extended to the fully polarized case in Sect. 4. formulation in the context of radio interferometry. Complex NLLS is implemented as follows. Let us for- mally treat z and z¯ as independent variables, define an aug- 2.1 Direction-independent calibration mented parameter vector containing both, Let us first explore the simplest case of direction- z independent (DI) calibration. Consider an interferometer ar- z˘ = (1.5) z¯ ray of Nant antennas measuring Nbl = Nant(Nant −1)/2 pair-   wise visibilities. Each antenna pair pq (1 6 p < q 6 N ) ˘ ant and designate its value at step k by zk. Then, define measures the visibility3

∂v ∂v r(z˘k) J k = (z˘k), J k∗ = (z˘k), r˘k = (1.6) gpmpqg¯q + npq, (2.1) ∂z ∂z¯ r¯(z˘k)   where mpq is the (assumed known) sky coherency, gp is the We’ll call the m × n matrices J k and J k∗ the partial and (unknown) complex gain parameter associated with antenna 2 partial conjugate Jacobian respectively, and the 2m-vector p, and npq is a complex noise term that is Gaussian with a ˘ rk the augmented residual vector. The complex Jacobian of mean of 0 in the real and imaginary parts. The calibration the model v can then be written (in block matrix form) as problem then consists of estimating the complex antenna g J J ∗ gains by minimizing residuals in the LS sense: J = k k , (1.7) J¯ ∗ J¯ 2 k k min |rpq| , rpq = dpq − gpmpqg¯q, (2.2)   g pq with the bottom two blocks being element-by-element con- X jugated versions of the top two. Note the use of J¯ to indi- where dpq are the observed visibilities. Treating this as a cate element-by-element conjugation – not to be confused complex optimization problem as per the above, let us write with the Hermitian conjugate which we’ll invoke later. J is out the complex Jacobian. With a vector of Nant complex a 2m × 2n matrix. The GN update step is defined as parameters g and Nbl measurements dpq, we’ll have a full complex Jacobian of shape 2N × 2N . It is conventional δz bl ant δz˘ = =(J H J)−1J H r˘ , (1.8) to think of visibilities laid out in a visibility matrix; the nor- δz¯ k   mal approach at this stage is to vectorize dpq by fixing a The LM approach is similar, but introduces a damping numbering convention so as to enumerate all the possible parameter λ: antenna pairs pq (p < q) using numbers from 1 to Nbl. In- stead, let us keep using pq as a single “compound index”, δz δz˘ = =(J H J + λD)−1J H r˘ , (1.9) with the implicit understanding that pq in subscript corre- δz¯ k   sponds to a single index from 1 to Nbl using some fixed where D is the diagonalized version of J H J. With λ = 0 this enumeration convention. Where necessary, we’ll write pq in becomes equivalent to GN, with λ →∞ this corresponds to square brackets (e.g. a[pq],i) to emphasize this. steepest descent (SD) with ever smaller steps. Now consider the corresponding partial Jacobian J k Note that while δz and δz¯ are formally computed inde- matrix (Eq. 1.6). This is of shape Nbl × Nant. Using the pendently, the structure of the equations is symmetric (since Wirtinger derivative, we can write the partial Jacobian in the function being minimized – the Frobenius norm – is real terms of its value at row [pq] and column j as and symmetric w.r.t. z and z¯), which ensures that δz = δz¯. m g¯ , j = p, [J ] = pq q (2.3) In practice this redundancy usually means that only half the k [pq],j 0, otherwise. calculations need to be performed. 

3 In principle, the autocorrelation terms pp, corresponding to the 1 It should be stressed that Wirtinger calculus can be applied to total power in the field, are also measured, and may be incorpo- a broader range of optimization problems than just LS. rated into the equations here. It is, however, common practice to 2 Laurent et al. (2012) define the Jacobian via ∂r rather than omit autocorrelations from the interferometric calibration prob- ∂v. This yields a Jacobian of the opposite sign, and introduces a lem due to their much higher noise, as well as technical difficulties minus sign into Eqs. 1.8 and 1.9. In this paper we use the ∂v con- in modeling the total intensity contribution. The derivations be- vention, as is more common in the context of radio interferometric low are equally valid for p 6 q; we use p

In other words, within each column j, J k is only non- since the value at row i, column j of each block is zero at rows corresponding to baselines [jq]. We can express 2 |yiq |, i=j this more compactly using the Kronecker delta: i j Aij = y¯pqypq δpδp = q6=i pq ( P 0, i6=j X j=1...Nant i j y¯ij y¯ji, i6=j j Bij = y¯pqy¯qpδpδq = [pq]=1...N (pp): P y r δi P y r¯  qp pq q   iq iq i=1...Nant pq q6=i P   P  r d − g m g¯ r }[pq]=1...N (p

j=1...Nant j=1...Nant data and model visibilities as 2Nbl-vectors, using the com- pound index [pq] (p 6= q): j j J = m g¯ δ g m δ , r˘ = r }[pq]=1...2N pq q p p pq q pq bl d˘= [d ], v˘ = [g m g¯ ], r˘ = d˘− v˘ (2.15) z }| { z }| { (2.8) pq p pq q     where for q >p, rpq =r ¯qp and mpq =m ¯ qp. For clar- As noted by Tasse (2014), we have the wonderful prop- ity, we may adopt the following order for enumerating the erty that row index [pq]: 12, 13,..., 1n, 21, 22,..., 2n, 31, 32,..., 3n, 1 g ...,n1,...,nn − 1. v˘ = J g = Jg˘, where g˘ = , (2.16) L 2 g¯ Equation A2 in the Appendix provides an example of J   for the 3-antenna case. For brevity, let us define the short- (where XL designates the left half of matrix X – see Table 1 hand for a summary of notation), which basically comes about due to the RIME being bilinear with respect to g and g¯. Substituting this into the GN update step, and noting that ypq = mpqg¯q. (2.9) X(Y L)=(XY )L, we have

H δg We can now write out the structure of J J. This is Hermi- =(J H J)−1J H (d˘−J g)=(J H J)−1J H d˘−g. (2.17) δg¯ L tian, consisting of four Nant × Nant blocks:   Consequently, the updated gain values at each iteration can A B A B be derived directly from the data, thus obviating the need J H J = = (2.10) C D BH A for computing residuals. Additionally, since the bottom half     Radio interferometric gain calibration as a complex optimization problem 5 of the equations is simply the conjugate of the top, we only 2.4 Weighting need to evaluate the top half: Although Laurent et al. (2012) do not mention this explic- H −1 H ˘ itly, it is straightforward to incorporate weights into the gk+1 = gk + δg =(J J) UJ d, (2.18) complex LS problem. Equation 1.4 is reformulated as where XU designates the upper half of matrix X. min ||W r˘(z, z¯)||F , (2.24) The derivation above assumes an exact inversion of z H = J H J. In practice, this large matrix can be costly to invert, so the algorithms below will substitute it with some where W is an M × M weights matrix (usually, the inverse cheaper-to-invert approximation H˜ . Using the approximate of the data covariance matrix C). This then propagates into matrix in the GN update equation, we find instead that the LM equations as ˘ H I −1 H ˘ ˜ −1 H ˘ I ˜ −1 δz =(J W J + λ ) J W rk. (2.25) gk+1 = H UJ d +( − H UHL)gk, (2.19) Adding weights to Eqs. 2.22 and 2.23, we arrive at the which means that when an approximate J H J is in use, the following: shortcut of Eq. 2.18 only applies when 2 H ˜ −1 diag wiqs|yiqs| ր H UHL = I. (2.20) q6=i,s J H W J =  P  (2.26) wijsyijsyjis, i6=j We will see examples of both conditions below, so it is  s ց  worth stressing the difference: Eq. 2.18 allows us to compute (P 0 , i=j    updated solutions directly from the data vector, bypassing   the residuals. This is a substantial computational shortcut, however, when an approximate inverse for H is in use, it wiqsy¯iqsriqs J H W r˘ q6=i,s does not necessarily apply (or at least is not exact). Under = . (2.27)  P ↓H  the condition of Eq. 2.20, however, such a shortcut is exact.  

2.5 Direction-dependent calibration 2.3 Time/frequency solution intervals Let us apply the same formalism to the direction-dependent A common use case (especially in low-SNR scenarios) is to (DD) calibration problem. We reformulate the sky model employ larger solution intervals. That is, we measure mul- as a sum of Ndir sky components, each with its own DD tiple visibilities per each baseline pq, across an interval of gain. It has been common practice to do DD gain solu- timeslots and frequency channels, then obtain complex gain tions on larger time/frequency intervals than DI solutions, solutions that are constant across each interval. The mini- both for SNR reasons, and because short intervals lead to mization problem of Eq. 2.1 can then be re-written as under-constrained solutions and suppression of unmodeled

2 sources. We therefore incorporate solution intervals into the min |rpqs| , rpqs = dpqs − gpmpqsg¯q, (2.21) g equations from the beginning. The minimization problem pqs X becomes: where s = 1, ..., Ns is a sample index enumerating all the Ndir samples within the time/frequency solution interval. We can 2 (d) (d) (d) min |rpqs| , rpqs = dpqs − gp mpqsg¯q . (2.28) g repeat the derivations above using [pqs] as a single com- pqs X Xd=1 pound index. Instead of having shape 2Nbl × 2Nant, the Ja- cobian will have a shape of 2NblNs ×2Nant, and the residual It’s obvious that the Jacobian corresponding to this problem H vector will have a length of 2NblNs. In deriving the J J is very similar to the one in Eq. 2.8, but instead of having term, the sums in Eq. 2.10 must be taken over all pqs rather shape 2Nbl × 2Nant, this will have a shape of 2NblNs × than just pq. Defining the usual shorthand of ypqs = mpqsg¯q, 2NantNdir. We now treat [pqs] and [jd] as compound indices: we then have: j=1...Nant j=1...Nant 2 H diag |yiqs| ր d=1...Ndir d=1...Ndir q6=i,s H   J J = P i6=j , (2.22) yijsyjis, (d) (d) j (d) (d) j [pq]=1...2Nbl (p6=q) s ց J = mpqsg¯q δp gp mpqsδq   z }| { z }| { s=1...Ns (P 0, i=j       (2.29)   where the symbols ց and րH represent a copy and a copy- Every antenna j and direction d will correspond to a column transpose of the appropriate matrix block (as per the struc- in J, but the specific order of the columns (corresponding H (d) ture of Eq. 2.10). Likewise, the J r˘ term can be written to the order in which we place the gp elements in the aug- as: mented parameter vector g˘) is completely up to us. Consider now the J H J product. This will consist of 2×2 y¯ r 2 iqs iqs blocks, each of shape [NantNdir] . Let’s use i, c to designate H ˘ q6=i,s J r = . (2.23) the rows within each block, j, d to designate the columns,  P H  ↓ (d) (d) (d) H and define ypqs = mpqsg¯q . The J J matrix will then have   6 O.M. Smirnov & C. Tasse the following block structure: price of using an approximate inverse for J H J is a less accu- rate update step, so we can expect to require more iterations i (c) (d) H δj y¯iqsyiqs ր before convergence is reached. H q6=i,s H A B Combining this approximation with GN optimization J J = =  P(c) (d)  , (2.30) BA yjisyijs , i6=j and using Eq. 2.19, we find the following expression for the   s ց   GN update step: (P 0 i=j      ˜ −1 H ˘ H g = HULJ L d, (3.2) while the J r˘ term will be a vector of length 2NantNdir, k+1 with the bottom half again being a conjugate of the top where XUL designates the top left quadrant of matrix X. half. Within each half, we can write out the element corre- ˜ −1 Note that the condition of Eq. 2.20 is met: H UHL = sponding to antenna j, direction d: A−1A = I, i.e. the GN update can be written in terms d˘ H c (d) of . This comes about due to (i) the off-diagonal blocks J r˘ = , c = y¯ rjqs. (2.31) ˜ c jd jqs of H being null, which masks out the bottom half of HL, ¯ ˜   qX6=j,s and (ii) the on-diagonal blocks of H being an exact inverse. Finally, let us note that the property of Eq. 2.16 also In other algorithms suggested below, the second condition holds for the DD case. It is easy to see that particularly is not the case. The per-element expression, in the diagonal approxima- Ndir (d) (d) (d) tion, is v˘ = gp mpqsg¯q = J Lg˘. (2.32) −1  Xd=1  gp,k+1 = y¯pqypq y¯pqdpq. (3.3) q6=p q6=p Consequently, the shortcut of Eq. 2.18 also applies. X  X Equation 3.3 is identical to the update step proposed by Hamaker (2000), and later adopted by Mitchell et al. (2008) for MWA calibration, and independently derived by 3 INVERTING J H J AND SEPARABILITY Salvini & Wijnholds (2014a) for the StefCal algorithm. In principle, implementing one of the flavours of calibra- Note that these authors arrive at the result from a differ- tion above is “just” a matter of plugging Eqs. 2.12+2.14, ent direction, by treating Eq. 2.2 as a function of g only, 2.22+2.23, 2.26+2.27 or 2.30+2.31 into one the algorithms and completely ignoring the conjugate term. The resulting defined in Appendix C. Note, however, that both the GN complex Jacobian (Eq. 2.6) then has null off-diagonal blocks, and LM algorithms hinge around inverting a large matrix. and J H J becomes diagonal. This will have a size of 2N or 2N N squared, for the Interestingly, applying the same idea to LM optimiza- ant ant dir ˜ DI or DD case respectively. With a naive implementation tion (Eq. 1.9), and remembering that H is diagonal, we can of matrix inversion, which scales cubically, algorithmic costs derive the following update equation instead: 3 3 3 become dominated by the O(Nant) or O(NantNdir) cost of λ 1 −1 g = g + H˜ J H d˘, (3.4) inversion. k+1 1+ λ k 1+ λ UL L In this section we investigate approaches to simplify- ing the inversion problem by approximating H = J H J by which for λ = 1 essentially becomes the basic StefCal some form of (block-)diagonal matrix H˜ . Such approxima- average-update step of . We should note that tion is equivalent to separating the optimization problem Salvini & Wijnholds (2014a) empirically find better conver- into of parameters that are treated as independent. gence when Eq. 3.2 is employed for odd k, and Eq. 3.4 for even k. In terms of the framework defined here, the basic We will show that some of these approximations are simi- StefCal lar to or even fully equivalent to previously proposed algo- algorithm can be succinctly described as com- J H J rithms, while others produce new algorithmic variations. plex optimization with a diagonally-approximated , us- ing GN for the odd steps, and LM (λ = 1) for the even steps. 3.1 Diagonal approximation and StefCal Establishing this equivalence is very useful for our pur- poses, since the convergence properties of StefCal have Let us first consider the DI case. The structure of J H J in been thoroughly explored by Salvini & Wijnholds (2014a), Eq. 2.12 suggests that it is diagonally dominant (especially and we can therefore hope to apply these lessons here. In for larger Nant), as each diagonal element is a coherent sum particular, these authors have shown that a direct appli- of Nant amplitude-squared y-terms, while the off-diagonal cation of GN produces very slow (oscillating) convergence, elements are either zero or a product of two y terms. This whereas combining GN and LM leads to faster convergence. is graphically illustrated in Fig. 1(f). It is therefore not un- They also propose a number of variations of the algorithm, reasonable to approximate J H J with a diagonal matrix for all of which are directly applicable to the above. purposes of inversion (or equivalently, making the assump- Finally, let us note in passing that the update step of tion that the problem is separable per antenna): Eq. 3.3 is embarrassingly parallel, in the sense that the up- date for each antenna is computed entirely independently. A 0 H˜ = (3.1) 0 A   3.2 Separability of the direction-dependent case This makes the costs of matrix inversion negligible – O(Nant) 2 H operations, as compared to the O(Nant) cost of computing Now consider the problem of inverting J J in the DD case. the diagonal elements of the Jacobian in the first place. The This is a massive matrix, and a brute force approach would Radio interferometric gain calibration as a complex optimization problem 7

Classical Re/Im Wirtinger (a) (e) CDA

(b) (f) CAD

(c) (g) DCA

(d) (h) ACD

Figure 1. A graphical representation of JH J for a case of 40 antennas and 5 directions. Each pixel represents the amplitude of a single matrix element. The left column (a–d) shows conventional real-only Jacobians constructed by taking the partial derivatives w.r.t. the real and imaginary parts of the gains. The ordering of the parameters is (a) real/imaginary major, direction, antenna minor (i.e. antenna changes fastest); (b) real/imaginary, antenna, direction; (c) direction, real/imaginary, antenna; (d) antenna, real/imaginary, direction. The right column (e–h) shows full complex Jacobians with similar parameter ordering (direct/conjugate instead of real/imaginary). Note that panel (f) can also be taken to represent the direction-independent case, if we imagine each 5 × 5 block as one pixel. 8 O.M. Smirnov & C. Tasse

3 3 scale as O(NdirNant). We can, however, adopt a few approx- expressed as imations. Note again that we are free to reorder our aug- (d) (d) (d) mented parameter vector (which contains both g and g¯), as mpqtν = Spqtν kpqtν , (3.13) J H J long as we reorder the rows and columns of accord- where the term S represents the visibility of that sky model ingly. component if placed at phase centre (usually only weakly g˘ Let us consider a number of orderings for : dependent on t,ν – in the case of a source, for example, • conjugate major, direction, antenna minor (CDA): S is just a constant flux term), while the term

(1) (1) (2) (2) (3) (Ndir) (1) (1) T (d) −2πi(upq (t)·σd)ν/c T k = e , σd = [ld,md, nd − 1] , (3.14) [g1 ...gNant ,g1 ...gNant ,g1 ...gNant , g¯1 . . . g¯Nant . . . ] pqtν (3.5) σ • conjugate, antenna, direction (CAD): represents the phase rotation to direction d (where lmn are the corresponding direction cosines), given a baseline vector (1) (Ndir) (1) (Ndir) (1) (Ndir) (1) T as a function of time upq (t). We can then approximate the [g1 ...g1 ,g2 ...g2 ,g3 ...gNant , g¯1 . . . ] (3.6) sky model dot product above as • direction, conjugate, antenna (DCA): (cd) (c) (d) −2πi[upq (t)·(σc−σd)]ν/c Xpq = Spq Spq e (3.15) (1) (1) (1) (1) (2) (2) (2) T [g ...g , g¯ . . . g¯ ,g ...g , g¯ . . . ] (3.7) s 1 Nant 1 Nant 1 Nant 1 X • antenna, conjugate, direction (ACD): The sum over samples s is essentially just an over a complex fringe. We may expect this to be small (i.e. (1) (Ndir) (1) (Ndir) (1) (Ndir) (1) T [g1 ...g1 , g¯1 . . . g¯1 ,g2 ...g2 , g¯2 . . . ] the sky model components to be more orthogonal) if the (3.8) directions are well-separated, and also if the sum is taken Figure 1(e–h) graphically illustrates the structure of the over longer time and frequency intervals. Jacobian under these orderings. If we now assume that the sky model components are At this point we may derive a whole family of DD cal- orthogonal or near-orthogonal, then we may treat the “cross- J H J ibration algorithms – there are many ways to skin a cat. direction” blocks of the matrix in Eq. 3.9 as null. The J H J Each algorithm is defined by picking an ordering for g˘, then problem is then separable by direction, and is approx- examining the corresponding J H J structure and specifying imated by a block-diagonal matrix: an approximate matrix inversion mechanism, then applying 1 J1 0 GN or LM optimization. Let us now work through a couple ˜ .. of examples. H =  .  , (3.16) 0 J Ndir  Ndir    3 3.2.1 DCA: separating by direction The inversion complexity then reduces to O(NdirNant), which, for large numbers of directions, is a huge improve- Let us first consider the DCA ordering (Fig. 1g). The J H J ment on O(N 3 N 3 ). Either GN and LM optimization may term can be split into N × N blocks: dir ant dir dir now be applied. 1 Ndir J1 . . . J1 H . . J J =  . .  , (3.9) 1 Ndir 3.2.2 COHJONES: separating by antenna JN . . . J   dir Ndir    A complementary approach is to separate the problem by where the structure of each 2N × 2N block at row c, ant ant antenna instead. Consider the CAD ordering (Fig. 1f). The column d, is exactly as given by Eq. 2.30 (or by Fig. 1f, in top half of J H J then has the following block structure (and miniature). d the bottom half is its symmetric conjugate): The on-diagonal (“same-direction”) blocks Jd will have 1 1 Nant the same structure as in the DI case (Eq. 2.12 or Eq. A3). A1 0 B1 . . . B d 1 Consider now the off-diagonal (“cross-direction”) blocks Jc . . . H = .. . . , (3.17) Their non-zero elements can take one of two forms: U  . . .  0 ANant B1 . . . BNant (c) (d) (c) (d) (c) (d)  Nant Nant Nant  y¯iqsyiqs = gq g¯q m¯ iqsmiqs (3.10)   q6=i,s q6=i s that is, its left half is block-diagonal, consisting of N ×N X X X dir dir or blocks (which follows from Eq. 2.30), while its right half (c) (d) (c) (d) (c) (d) consists of elements of the form given by Eq. 3.11. yjisyijs =g ¯i g¯j m¯ ijsmijs (3.11) By analogy with the StefCal approach, we may as- s s i X X sume Bj ≈ 0, i.e. treat the problem as separable by an- A common element of both is essentially a dot product tenna. The H˜ matrix then becomes block-diagonal, and we i of sky model components. This is a measure of how “non- only need to compute the true matrix inverse of each Ai. 3 orthogonal” the components are: The inversion problem then reduces to O(NdirNant) in com- plexity, and either LM or GN optimization may be applied. (cd) m(c) m(d) (c) (d) Xpq = pq , pq = mpqsm¯ pqs. (3.12) For GN, the update step may be computed in direct s D E X analogy to Eq. 3.3 (noting that Eq. 2.20 holds): We should now note that each model component will typi- ˜ −1 H ˘ cally correspond to a source of limited extent. This can be gk+1 = HULJ L d. (3.18) Radio interferometric gain calibration as a complex optimization problem 9

We may note in passing that for LM, the analogy is only Table 2. The scaling of computational costs for a single iteration approximate: of the four DD calibration algorithms discussed in the text, broken down by computational step. The dominant term(s) in each case λ 1 −1 ˘ ˘ ˜ H ˘ H ˘ gk+1 ≈ gk + HULJ L d, (3.19) are marked by “†”. Not shown is the cost of computing the J d 1+ λ 1+ λ 2 vector, which is O(NantNdir) for all algorithms. “Exact” refers since H˜ is only approximately diagonal. to a naive implementation of GN or LM with exact inversion of H This approach has been implemented as the CohJones the J J term. Scaling laws for DI calibration algorithms may CohJones (complex half-Jacobian optimization for n-directional esti- be obtained by assuming Ndir = 1, in which case or AllJones become equivalent to StefCal. mation) algorithm4, the results of which applied to simu- lated data are presented below. −1 algorithm H˜ H˜ multiply

2 2 3 3 † 2 2 Exact O(NantNdir) O(NantNdir) O(NantNdir) 3.2.3 ALLJONES: Separating all AllJones 2 † O(NantNdir) O(NantNdir) O(NantNdir) CohJones 2 2 † 3 † 2 Perhaps the biggest simplification available is to start with O(NantNdir) O(NantNdir) O(NantNdir) 2 3 † 2 CDA or CAD ordering, and assume a StefCal-style di- DCA O(NantNdir) O(NantNdir) O(NantNdir) agonal approximation for the entirety of J H J. The matrix then becomes purely diagonal, matrix inversion reduces to one or the other algorithm has a computational advantage. O(NdirNant) in complexity, and algorithmic cost becomes 2 This should be investigated in a future work. dominated by the O(NdirNant) process of computing the di- agonal elements of J H J. Note that Eq. 2.20 no longer holds, and the GN update step must be computed via the residuals: 3.4 Smoothing in time and frequency ˜ −1 H ˘ gk+1 = gk + HULJ L r, (3.20) From physical considerations, we know that gains do not vary arbitrarily in frequency and time. It can therefore be with the per-element expression being desirable to impose some sort of constraint on −1 (d) (d) (d) (d) (d) the solutions, which can improve conditioning, especially in gp,k+1 = gp,k + y¯pqsypqs y¯pqsrpqs. (3.21) low-SNR situations. A simple but crude way to do this is  qX6=p,s  qX6=p,s use solution intervals (Sect. 2.3), which gives a constant gain solution per interval, but produces non-physical jumps at the 3.3 Convergence and algorithmic cost edge of each interval. Other approaches include a posteriori smoothing of solutions done on smaller intervals, as well as Evaluating the gain update (e.g. as given by Eq. 2.18) in- various filter-based algorithms (Tasse 2014). volves a number of computational steps: Another way to impose smoothness combines the ideas (i) Computing H˜ ≈ J H J, of solution intervals (Eq. 2.21) and weighting (Eq. 2.24). (ii) Inverting H˜ , At every time/frequency sample t0,ν0, we can postulate a (iii) Computing the J H d˘ vector, weighted LS problem: (iv) Multiplying the result of (ii) and (iii). 2 min w(t − t0,ν − ν0)|rpqtν | , (3.22) g pqtν Since each of the algorithms discussed above uses a dif- X ferent sparse approximation for J H J, each of these steps where w is a smooth weighting kernel that upweighs samples 2 will scale differently (except iii, which is O(NantNdir) for at or near the current sample, and downweighs distant sam- all algorithms). Table 2 summarizes the scaling behaviour. ples (e.g., a 2D Gaussian). The solutions for adjacent sam- An additional scaling factor is given by the number of it- ples will be very close (since they are constrained by prac- erations required to converge. This is harder to quantify. tically the same range of data points, with only a smooth For example, in our experience (Smirnov 2013), an “exact” change in weights), and the degree of smoothness can be (LM) implementation of DI calibration problem converges controlled by tuning the width of the kernel. in much fewer iterations than StefCal (on the order of a On the face of it this approach is very expensive, since few vs. a few tens), but is much slower in terms of “wall it entails an independent LS solution centred at every t0,ν0 3 time” due to the more expensive iterations (Nant vs. Nant sample. The diagonal approximation above, however, allows scaling). This trade-off between “cheap–approximate” and for a particularly elegant and efficient way of implementing “expensive–accurate” is typical for iterative algorithms. this in practice. Consider the weighted equations of Eqs. 2.26 CohJones accounts for interactions between directions, and 2.27, and replace the sample index s by t,ν. Under the but ignores interactions between antennas. Early experi- diagonal approximation, each parameter update at t0,ν0 is ence indicates that it converges in a few tens of iterations. computed as: AllJones uses the most approximative step of all, ignor- w(t − t0,ν − ν0)¯ypqtν dpqtν ing all interactions between parameters. Its convergence be- q6=p,t,ν gp,k+1(t0,ν0)= . (3.23) haviour is untested at this time. P w(t − t0,ν − ν0)¯ypqtν ypqtν q6=p,t,ν It is clear that depending on Nant and Ndir, and also P on the structure of the problem, there will be regimes where Looking at Eq. 3.23, it’s clear that both sums represent a convolution. If we define two functions of t,ν: 4 CohJones In fact it was the initial development of by Tasse αp(t,ν)= y¯pqtν dpqtν , βp(t,ν)= y¯pqtν ypqtν , (3.24) (2014) that directly led to the present work. Xq6=p Xq6=p 10 O.M. Smirnov & C. Tasse then Eq. 3.23 corresponds to the ratio of two convolutions we can vectorize each matrix equation, turning it into an

w ◦ αp equation on 4-vectors, and then derive the complex Jacobian gp,k+1(t,ν)= , (3.25) w ◦ βp in the usual manner (Eq. 1.7). In this section we will obtain a more elegant derivation, sampled over a discrete t,ν grid. Note that the formula- by employing an operator calculus where the “atomic” ele- tion above also allows for different smoothing kernels per ments are 2 × 2 matrices rather than scalars. This will allow antenna. Iterating Eq. 3.25 to convergence at every t,ν slot, us to define the Jacobian in a more transparent way, as a we obtain per-antenna arrays of gain solutions answering matrix of linear operators on 2×2 matrices. Mathematically, Eq. 3.22. These solutions are smooth in frequency and time, this is completely equivalent to vectorizing the problem and with the degree of smoothness constrained by the kernel w. applying Wirtinger calculus (each matrix then corresponds There is a very efficient way of implementing this in to 4 elements of the parameter vector, and each operator practice. Let’s assume that d and y are loaded into mem- pq pq in the Jacobian becomes a 4 × 4 matrix block). The ca- ory and computed for a large chunk of t,ν values simulta- sual reader may simply take the postulates of the following neously (any practical implementation will probably need section on faith – in particular, that the operator calculus to do this anyway, if only to take advantage of vectorized approach is completely equivalent to using 4 × 4 matrices. math operations on modern CPUs and GPUs). The param- A rigorous formal footing to this is given in Appendix B. eter update step is then also evaluated for a large chunk of t,ν, as are the αp and βp terms. We can then take advantage of highly optimized implementations of convolution (e.g. via 4.1 Matrix operators and derivatives FFTs) that are available on most computing architectures. Smoothing may also be trivially incorporated into the By matrix operator, we shall refer to any function F that AllJones algorithm, since its update step (Eq. 3.21) has ex- maps a 2 × 2 complex matrix to another such matrix: actly the same structure. A different smoothing kernel may F : C2×2 → C2×2. (4.3) be employed per direction (for example, directions further from the phase centre can employ a narrower kernel). When the operator F is applied to matrix X, we’ll write the Since smoothing involves computing a g value at ev- result as Y = FX, or F[X] if we need to avoid ambiguity. ery t,ν point, rather than one value per solution interval, If we fix a complex matrix A, then two interesting (and its computational costs are correspondingly higher. To put linear) matrix operators are right-multiply by A, and left- multiply by A: it another way, using solution intervals of size Nt × Nν in- troduces a savings of Nt × Nν (in terms of the number of RAX = XA (4.4) invert and multiply steps required, see Table 2) over solving LAX = AX the problem at every t,ν slot; using smoothing foregoes these savings. In real implementations, this extra cost is mitigated Appendix B formally shows that all linear matrix operators, by the fact that the computation given by Eq. 3.21 may be including RA and LA, can be represented as multiplication vectorized very efficiently over many t,ν slots. However, this of 4-vectors by 4 × 4 matrices. vectorization is only straightforward because the matrix in- Just from the operator definitions, it is trivial to see version in StefCal or AllJones reduces to simple scalar that −1 division. For CohJones or DCA this is no longer the case, so −1 LALB = LAB, RARB = RBA, [RA] = RA (4.5) while smoothing may be incorporated into these algorithms in principle, it is not clear if this can be done efficiently in Consider now a matrix-valued function of n matrices practice. and their Hermitian transposes H H F (G1 . . . Gn, G1 . . . Gn ), (4.6) and think what a consistent definition for the partial matrix 4 THE FULLY POLARIZED CASE derivative ∂F /∂Gi would need be. A at ˘ H H To incorporate polarization, let us start by rewriting the some fixed point G0 = (G1 . . . Gn, G1 . . . Gn ) is a local basic RIME of Eq. 2.1 using 2 × 2 matrices (a full derivation linear approximation to F , i.e. a linear function mapping may be found in Smirnov 2011a): an increment in an argument ∆Gi to an increment in the function value ∆F . In other words, the partial derivative is D = G M GH + N . (4.1) pq p pq q pq a linear matrix operator. Designating this operator as D = Here, Dpq is the visibility matrix observed by baseline pq, ∂F /∂Gi, we can write the approximation as: M pq is the sky coherency matrix, Gp is the Jones matrix F (..., Gi +∆Gi, ...) − F (..., Gi, ...) ≈ D∆Gi. (4.7) associated with antenna p, and N pq is a noise matrix. Quite importantly, the visibility and coherency matrices are Her- Obviously, the linear operator that best approximates a H H mitian: Dpq = Dqp, and M pq = M qp. The basic polariza- given linear operator is the operator itself, so we necessarily tion calibration problem can be formulated as have H GA AGH min ||Rpq||F , Rpq = Dpq − GpM pqGq . (4.2) ∂( ) ∂( ) {Gp} = RA, H = LA. (4.8) pq ∂G ∂G X This is a set of 2 × 2 matrix equations, rather than the Appendix B puts this on a formal footing, by providing vector equations employed in the complex NNLS formalism formal definitions of Wirtinger matrix derivatives above (Eq. 1.4). In principle, there is a straightforward way ∂F ∂F , (4.9) of recasting matrix equations into a form suitable to Eq. 1.4: ∂G ∂GH Radio interferometric gain calibration as a complex optimization problem 11 that are completely equivalent to the partial complex Jaco- We may now make exactly the same observation as we bians defined earlier. did to derive Eq. 2.8, and rewrite both J and R˘ in terms of Note that this calculus also offers a natural way of tak- a single row block. The pq index will now run through 2Nbl ing more complicated matrix derivatives (that is, for more values (i.e. enumerating all combinations of p 6= q): elaborate versions of the RIME). For example, j j J = RM GH δp LGpM pq δq [pq]=1...2Nbl (p6=q) (4.17) ∂(AGB) pq q = LARB , (4.10) h i ∂G and which is a straightforward manifestation of the : ˘ R = Rpq }[pq]=1...2N (p6=q) (4.18) AGB = A(GB). bl This is in complete analogy to the derivations of the unpo- larized case. For compactness, let us now define 4.2 Complex Jacobians for the polarized RIME H H H Y pq = M pq Gq , Y qp = M pqGp (4.19) Let us now apply this operator calculus to Eq. 4.2. Taking the derivatives, we have: noting that H H H ∂V pq ∂V pq Y pq = GqM pq , Y qp = GpM pq . (4.20) H = RM pq G , and H = LGpMpq . (4.11) ∂Gp q G ∂ q Employing Eq. B12, the J and J H terms can be written as If we now stack all the gain matrices into one augmented i R H δ “vector of matrices”: H Ypq p j j J = , J = RYpq δp LY H δq (4.21) i qp G˘ H H T "LYqp δq # = [G1,..., GNant , G1 ,..., GNant ] , (4.12) h i H then we may construct the top half of the full complex Jaco- We can now write out the J J term, still expressed in bian operator in full analogy with the derivation of Eq. 2.6. terms of operators, as:

We’ll use the same “compound index” convention for pq. R H L H , i6=j Yij Yji That is, [pq] will represent a single index running through diag RY Y H iq iq 0, i=j M values (i.e. enumerating all combinations of p < q). H  q6=i (  J J = P (4.22) i6=j j=1...Nant j=1...Nant  LYij RYji ,  diag L H  Yiq Yiq   0, i=j q6=i  j j   J U = R H δ LG M δ }[pq]=1...N (p

j j is met. R H δ LG M δ }[pq]=1...Nbl (p

The brute-force numerical approach is to substitute the Equations 4.27–4.28 can be considered the principal re- 4 × 4 matrix forms of the R and L operators (Eqs. B10 sult of this work. They provide the necessary ingredients and B11) into the equation, thus resulting in conventional for implementing GN or LM methods for DD calibration, matrix, and then do a straightforward matrix inversion. treating it as a fully complex optimization problem. The The second option is to use an analogue of the diago- equations may be combined and approximated in different nal approximation described in Sect. 3.1. If we neglect the ways to produce different types of calibration algorithms. off-diagonal operators of Eq. 4.22, the operator form of the Another interesting note is that Eq. 4.29 and its ilk are matrix is diagonal, i.e. the problem is again treated as being embarrassingly parallel, since the update step is completely separable per antenna. As for the operators on the diagonal, separated by direction and antenna. This makes it partic- they are trivially invertible as per Eq. 4.5. We can there- ularly well-suited to implementation on massively parallel fore directly invert the operator form of J H J and apply it architectures such as GPUs. to J H R˘ , thus arriving at a simple per-antenna equation for the GN update step: −1 H H Gp,k+1 = Y pqDpq Y pq Y pq (4.26)     5 OTHER DD ALGORITHMIC VARIATIONS Xq6=p Xq6=p This is, once again, equivalent  to the polarized Ste- The mathematical framework developed above (in particu- fCal update step proposed by Smirnov (2013) and lar, Eqs. 4.27–4.28) provides a general description of the po- Salvini & Wijnholds (2014a). larized DD calibration problem. Practical implementations of this hinge around inversion of a very large J H J matrix. The family of algorithms proposed in Sect. 3 takes different 4.4 Polarized direction-dependent calibration approaches to approximating this inversion. Their conver- Let us now briefly address the fully-polarized DD case. This gence properties are not yet well-understood; however we StefCal can be done by direct analogy with Sect. 2.5, using the op- may note that the algorithm naturally emerges erator calculus developed above. As a result, we arrive at from this formulation as a specific case, and its convergence the following expression for J H J: has been established by Salvini & Wijnholds (2014a). This is encouraging, but ought not be treated as anything more R (c)H L (d)H than a strong pointer for the DD case. It is therefore well i Y ijs Y jis δ R (d) (c)H s j Y Y worth exploring other approximations to the problem. In  q6=i,s iqs iqs (P 0  P (4.27) this section we out a few such options.  L (c) R (d) , i6=j  Y Y i  s ijs jis c d H   δj LY ( ) Y ( )  (P 0 i=j q6=i,s iqs iqs   P   (d) (d) (d)H using the normal shorthand of Y pqs = M pqsGq . The 5.1 Feed forward J H R˘ term is then Ste- (d)H Salvini & Wijnholds (2014b) propose variants of the Y iqs Riqs fCal algorithm (“2-basic” and “2-relax”) where the results J H R˘ = q6=i,s . (4.28)   of the update step (Eq. 4.26, in essence) are computed se- P ↓H quentially per antenna, with updated values for G1 . . . Gk−1   All the separability considerations of Sect. 3 now apply, and fed forward into the equations for Gk (via the appropriate polarized versions of the algorithms referenced therein may Y terms). This is shown to substantially improve conver- be reformulated for the fully polarized case. For example: gence, at the cost of sacrificing the embarrassing parallelism by antenna. This technique is directly applicable to both the • If we assume separability by both direction and an- AllJones and CohJones algorithms. AllJones ˜ tenna, as in the algorithm, then the H matrix The CohJones algorithm considers all directions si- is fully diagonal in operator form, and the GN update step multaneously, but could still implement feed-forward by an- can be computed as tenna. The AllJones algorithm (Eq. 4.29) could implement −1 feed-forward by both antenna (via Y ) and by direction – by (d) (d)H (d) (d)H recomputing the residuals R to take into account the up- δG = Y pqs Rpqs Y pqsY pqs . p,k+1     (1) (d−1) q6=p,s q6=p,s dated solutions for G . . . G before evaluating the so- X X (d)     (4.29) lution for G . The optimal order for this, as well as whether Note that in this case (as in the unpolarized AllJones ver- in practice this actually improves convergence to justify the sion) the condition of Eq. 2.20 is not met, so we must use extra complexity, is an open issue that remains to be inves- the residuals and compute δG. tigated. • If we only assume separability by antenna, as in the ˜ CohJones algorithm, then the H matrix becomes 4Ndir × 4Ndir-block-diagonal, and may be inverted exactly at a cost 3 of O(NdirNant). The condition of Eq. 2.20 is met. 5.2 Triangular approximation It is also straightforward to add weights and/or sliding The main idea of feed-forward is to take into account so- window averaging to this formulation, as per Sect. 2.4 and lutions for antennas (and/or directions) 1, ..., k − 1 when 3.4. computing the solution for k. A related approach is to ap- Radio interferometric gain calibration as a complex optimization problem 13 proximate the J H J matrix as block-triangular: Amplitude Phase

1 J1 0 · · · 0 J 1 J 2 · · · 0 ˜ 2 2 H =  . . .  , (5.1) . .. .   J 1 J 2 ··· J N   N N N    The inverse of this is also block triangular:

1 K1 0 · · · 0 K1 K2 · · · 0 ˜ −1 2 2 H =  . . .  , (5.2) . .. . Figure 3. Amplitude (left panel) and phase (right panel) of the   H K1 K2 ··· KN  block-diagonal matrix (J J)UL for the dataset described in the  N N N    text. Each block corresponds to one antenna; the pixels within a which can be computed using Gaussian elimination: block correspond to directions.

1 1 −1 K1 = [J1 ] 2 2 −1 K2 = [J2 ] 1 2 1 1 K2 = −K2J2 K1 3 3 −1 K3 = [J3 ] (5.3) 2 3 2 2 K3 = −K3J3 K2 0.04 1 3 1 1 2 1 K3 = −K3[J3 K1 + J3 K2] · · ·

From this, the GN or LM update steps may be derived di- 0.02 rectly.

0.00

5.3 Peeling m [radian]

The peeling procedure was originally suggested by Noordam 0.02 (2004) as a “kludge”, i.e. an implementation of DD calibra- tion using the DI functionality of existing packages. In a nutshell, this procedure solves for DD gains towards one 0.04 source at a time, from brighter to fainter, by 0.03 0.02 0.01 0.00 0.01 0.02 0.03 (i) Rephasing the visibilities to place the source at phase l [radian] centre; (ii) Averaging over some time/frequency interval (to sup- Figure 5. In order to conduct direction-dependent calibration, press the contribution of other sources); sources are clustered using a Voronoi tessellation algorithm. Each (iii) Doing a standard solution for DI gains (which ap- cluster has its own DD gain solution. proximates the DD gains towards the source); (iv) Subtracting the source from the visibilities using the obtained solutions; 5.4 Exact matrix inversion (v) Repeating the procedure for the next source. Better approximations to (J H J)−1 (or a faster exact in- The term “peeling” comes from step (iv), since sources verse) may exist. Consider, for example, Fig. 1f: the matrix are “peeled” away one at a time5. consists of four blocks, with the diagonal blocks being triv- ially invertible, and the off-diagonal blocks having a very Within the framework above, peeling can be considered as the ultimate feed forward approach. Peeling is essentially specific structure. All the approaches discussed in this pa- feed-forward by direction, except rather than taking one step per approximate the off-diagonal blocks by zero, and thus yield algorithms which converge to the solution via many over each direction in turn, each direction is iterated to full cheap approximative steps. If a fast way to invert matrices convergence before moving on to the next direction. The 3 procedure can then be repeated beginning with the bright- of the off-diagonal type (faster than O(N ), that is) could be found, this could yield calibration algorithms that converge est source again, since a second cycle tends to improve the solutions. in fewer more accurate iterations.

6 IMPLEMENTATIONS 5 The term “peeling” has occasionally been misappropriated to describe other schemes, e.g. simultaneous independent DD gain 6.1 StefCal in MeqTrees solutions. We consider this a misuse: both the original formulation by Noordam (2004), and the word “peeling” itself, strongly implies Some of the ideas above have already been implemented in dealing with one direction at a time. the MeqTrees (Noordam & Smirnov 2010) version of Ste- 14 O.M. Smirnov & C. Tasse

Direction 1 Direction 2 Direction 3 Direction 4 Direction 5 101 101 101 101 101 100 100 100 100 100 10-1 10-1 10-1 10-1 10-1 10-2 10-2 10-2 10-2 10-2 10-3 10-3 10-3 10-3 10-3 10-4 10-4 10-4 10-4 10-4 10-5 10-5 10-5 10-5 10-5 Residual Amplitude Residual 10-6 10-6 10-6 10-6 10-6 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

101 101 101 101 101 100 100 100 100 100 10-1 10-1 10-1 10-1 10-1 10-2 10-2 10-2 10-2 10-2 10-3 10-3 10-3 10-3 10-3 10-4 10-4 10-4 10-4 10-4

Residual Phase Residual 10-5 10-5 10-5 10-5 10-5 10-6 10-6 10-6 10-6 10-6 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Iteration Iteration Iteration Iteration Iteration

Figure 2. Amplitude (top row) and phase (bottom row) of the difference between the estimated and true gains, as a function of iteration. Columns correspond to directions. Different lines correspond to different antennas.

Deconvolved image DI Substracted DDE Substracted

1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0

0.5 0.5 0.5

1.0 1.0 1.0 Distance to the field center [deg]

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Distance to the field center [deg] Distance to the field center [deg] Distance to the field center [deg]

Figure 4. Simulation with time-variable DD gains. We show a deconvolved image (left) where no DD solutions have been applied, a residual image (centre) made by subtracting the sky model (in the visibility plane) without any DD corrections, and a residual image (right) made by subtracting the sky model with CohJones-estimated DD gain solutions (right). The color scale is the same in all panels. In this simulation, applying CohJones for DD calibration reduces the residual rms level by a factor of ∼ 4. fCal (Smirnov 2013). In particular, the MeqTrees version ment Sets. This section reports on tests of our implementa- uses peeling (Sect. 5.3) to deal with DD solutions, and im- tion with simulated Low Frequency Array (LOFAR) data. plements fully polarized StefCal with support for both so- For the tests, we build a dataset using a LOFAR layout lution intervals and time/frequency smoothing with a Gaus- with 40 antennas. The phase center is located at δ = +52◦, sian kernel (as per Sect. 3.4). This has already been applied the observing frequency is set to 50 MHz (single channel), to JVLA L-band data to obtain what is (at time of writing) and the integrations are 10s. We simulate 20 minutes of data. a world record dynamic range (3.2 million) image of the field For the first test, we use constant direction-dependent around 3C147 (Perley 2013). gains. We then run CohJones with a single solution inter- val corresponding to the entire 20 minutes. This scenario is essentially just a test of convergence. For the second test, CohJones 6.2 tests with simulated data we simulate a physically realistic time-variable ionosphere The CohJones algorithm, in the unpolarized version, has to derive the simulated DD gains. been implemented as a standalone Python script that uses the pyrap6 and casacore7 libraries to interface to Measure- 6.2.1 Constant DD gains

6 https://code.google.com/p/pyrap To generate the visibilities for this test, we use a sky model 7 https://code.google.com/p/casacore containing five sources in an “+” shape, separated by 1◦. The Radio interferometric gain calibration as a complex optimization problem 15

Direction 0 CONCLUSIONS 3 Recent developments in optimization theory have extended 2 traditional NLLS optimization approaches to functions of complex variables. We have applied this to radio interfero- 1 metric gain calibration, and shown that the use of complex Jacobians allow for new insights into the problem, leading 0 to the formulation of a whole new family of DI and DD cali- bration algorithms. These algorithms hinge around different Phase [rad] 1 sparse approximations of the J H J matrix; we show that some recent algorithmic developments, notably StefCal, 2 naturally fit into this framework as a particular special case 3 of sparse (specifically, diagonal) approximation. 0 5 10 15 20 The proposed algorithms have different scaling proper- Time [min] ties depending on the selected matrix approximation – in all cases better than the cubic scaling of brute-force GN or Figure 6. The complex phases of the DD gain terms (for all LM methods – and may therefore exhibit different compu- antennas and a single direction) derived from the time-variable tational advantages depending on the dimensionality of the TEC screen used in Sect. 6.2.2. problem (number of antennas, number of directions). We also demonstrate an implementation of one particular algo- rithm for DD gain calibration, CohJones. gains for each antenna p, direction d are constant in time, The use of complex Jacobians results in relatively com- (d) and are taken at random along a normal distribution gp ∼ pact and simple equations, and the resulting algorithms tend N (0, 1) + iN (0, 1). The data vector d˘ is then built from to be embarrassingly parallel, which makes them particu- all baselines, and the full 20 minutes of data. The solution larly amenable to implementation on new massively-parallel interval is set to the full 20 minutes, so a single solution per computing architectures such as GPUs. direction, per antenna is obtained. Complex optimization is applicable to a broader range H The corresponding matrix (J J)UL is shown in Fig. 3. of problems. Solving for a large number of independent DD gain parameters is not always desirable, as it potentially It is block diagonal, each block having size Ndir × Ndir. The convergence of gain solutions as a function of direction is makes the problem under-constrained, and can lead to arte- shown in Fig. 2. It is important to note that the prob- facts such as ghosts and source suppression. The alternative lem becomes better conditioned (and CohJones converges is solving for DD effect models that employ [a smaller set H faster) as the blocks of (J J)UL become more diagonally- of] physical parameters, such as parameters of the primary dominated (equivalently, as the sky model components be- beam and ionosphere. If these parameters are complex, then come more orthogonal). As discussed in Sect. 3.2.1, this hap- the complex Jacobian approach applies. Finally, although pens (i) when more visibilities are taken into account (larger this paper only treats the NLLS problem (thus implicitly solution intervals) or (ii) if the directions are further away assuming Gaussian statistics), the approach is valid for the from each other. general optimization problem as well. Other approximations or fast ways of inverting the com- plex J H J matrix may exist, and future work can potentially yield new and faster algorithms within the same unifying 6.2.2 Time-variable DD gains mathematical framework. This flexibility is particularly im- portant for addressing the computational needs of the new To simulate a more realistic dataset, we use a sky model generation of the so-called “SKA pathfinder” telescopes, as composed of 100 point sources of random (uniformly dis- well as the SKA itself. tributed) flux density. We also add noise to the visibilities, at a level of about 1% of the total flux. We simulate (scalar, phase-only) DD gains, using an ionospheric model consisting of a simple phase screen (an infinitesimally thin layer at a ACKNOWLEDGMENTS height of 100 km). The total electron content (TEC) values We would like to thank the referee, Johan Hamaker, for ex- at the set of sample points are generated using Karhunen- tremely valuable comments that improved the paper. This Loeve decomposition (the spatial correlation is given by Kol- work is based upon research supported by the South African mogorov turbulence, see van der Tol 2009). The constructed Research Chairs Initiative of the Department of Science TEC-screen has an amplitude of ∼ 0.07 TEC-Unit, and the and Technology and National Research Foundation. Trienko corresponding DD phase terms are plotted in Fig. 6. Grobler originally pointed us in the direction of Wirtinger For calibration purposes, the sources are clustered in 10 derivatives. directions using Voronoi tessellation (Fig. 5). The solution time-interval is set to 4 minutes, and a separate gain solution is obtained per each direction. Fig. 4 shows images generated REFERENCES from the residual visibilities, where the best-fitting model is subtracted in the visibility domain. The rms residuals after Hamaker J. P., 2000, A&AS, 143, 515 CohJones has been applied are a factor of ∼ 4 lower than Hamaker J. P., Bregman J. D., Sault R. J., 1996, A&AS, without DD solutions. 117, 137 16 O.M. Smirnov & C. Tasse

Kazemi S., Yatawatta S., 2013, MNRAS, 435, 597

Kreutz-Delgado K., 2009, arXiv:math/0906.4835 2 2 y12+y13 0 0 0¯y12y¯21 y¯13y¯31 Laurent S., van Barel M., de Lathauwer L., 2012, SIAM J. 2 2 0 y12+y23 0y ¯12y¯21 0y ¯23y¯32 Optim., 22, 879  2 2  0 0 y13+y23 y¯13y¯31 y¯23y¯32 0 Madsen K., Nielsen H. B., Tingleff O., 2004, Methods 2 2 (A3)  0 y12y21 y13y31 y12+y13 0 0  For Non-linear Least Squares Problems. Informatics and  2 2   y12y21 0 y23y32 0 y +y 0  Mathematical Modelling, Technical University of Den-  12 23   y y y y 0 0 0 y2 +y2   13 31 23 32 13 23  mark   Mitchell D. A., Greenhill L. J., Wayth R. B., Sault R. J., Finally, the 3-antenna J H r˘ term becomes Lonsdale C. J., Cappallo R. J., Morales M. F., Ord S. M., 2008, IEEE Journal of Selected Topics in Signal Process- y¯12r12 +y ¯13r13 ing, 2, 707 y¯21r21 +y ¯23r23 Noordam J. E., 2004, in Oschmann J. J. M., ed., Ground-   H y¯31r31 +y ¯32r32 based Telescopes Vol. 5489 of Society of Photo-Optical J r˘ = . (A4) y12r¯12 + y13r¯13   Instrumentation Engineers (SPIE) Conference Series, LO- y r¯ + y r¯   21 21 23 23 FAR calibration challenges. p. 817 y r¯ + y r¯   31 31 32 32 Noordam J. E., Smirnov O. M., 2010, A&A, 524, A61   Perley R., 2013, High Dynamic Range Imaging, presenta- tion at “The Radio Universe @ Ger’s (wave)-length” con- ference (Groningen, November 2013), APPENDIX B: OPERATOR CALCULUS http://www.astron.nl/gerfeest/presentations/perley.pdf First, let us introduce the vectorization operator “vec” and Salvini S., Wijnholds S. J., 2014a, A&A, 571, A97 its inverse in the usual (stacked columns) way8. For a 2 × 2 Salvini S., Wijnholds S. J., 2014b, in General Assembly and matrix X: Scientific Symposium (URSI GASS), 2014 XXXIth URSI x x Stefcal – an alternating direction implicit method for fast 11 11 x21 x21 full polarization array calibration. pp 1–4 vec X =   , vec−1   = X, (B1) x x Smirnov O. M., 2011a, A&A, 527, A106 12 12 x22 x22 Smirnov O. M., 2011b, A&A, 527, A107         Smirnov O. M., 2011c, A&A, 527, A108 which sets up an isomorphism between the space of 2 × 2 Smirnov O. M., 2013, StefCal: The fastest selfcal in complex matrices C2×2 and the space C4. Note that the “vec” the West, presentation at 3GC3 workshop (Port Alfred, operator is linear, in other words the isomorphism preserves February 2013), http://tinyurl.com/pzu8hco linear structure: Tasse C., 2014, arXiv:astro-ph/1410.8706 X Y X Y Tasse C., 2014, A&A, 566, A127 vec ( + a ) = vec + a vec , (B2) van der Tol S., 2009, PhD thesis, TU Delft, pp 1185–1205 as well as the Frobenius norm: Wirtinger W., 1927, Mathematische Annalen, 97, 357 ||vec X||F = ||X||F . (B3) Consider now the set of all linear operators on C2×2, or C2×2 C2×2 APPENDIX A: J AND J H J FOR THE Lin( , ). Any such linear operator B, whose action THREE-ANTENNA CASE we’ll write as BX, can be associated with a linear operator on 4-vectors B ∈ Lin(C4, C4), by defining B as

To give a specific example of complex Jacobians, consider the −1 3 antenna case. Using the numbering convention for pq of 12, Bx = vec BX, X = vec x. (B4) 13, 32, we obtain the following partial Jacobians (Eqs. 2.4 Conversely, any linear operator on 4-vectors B can be asso- and 2.5): ciated with a linear operator on 2 × 2 matrices by defining BX = vec−1 Bx, x = vec X. (B5) m12g¯2 0 0 0 g1m12 0 4 4 J k = m13g¯3 0 0 , J k∗ = 0 0 g1m13 Now, the set Lin(C , C ) is simply the set of all 4 × 4 ma-     0 m23g¯3 0 0 0 g2m23 trix multipliers. Equations B5 and B5 establish a one-to-one    (A1) mapping between this set and the set of linear operators on We then get the following expression for the full com- 2 × 2 matrices. In other words, the “vec” operator induces plex Jacobian J (Eq. 2.8): two isomorphisms: one between C4 and C2×2, and the other between C4×4 and linear operators on C2×2. We will desig- W m12g¯2 0 0 0 g1m12 0 nate the second isomorphism by the symbol : m13g¯3 0 0 0 0 g1m13 W −1   B = B : Bx = vec (B vec x) 0 m g¯ 0 g m 0 0 −1 −1 (B6) 21 1 2 21 (A2) W B = B : BX = vec (B vec X)  0 m23g¯3 0 0 0 g2m23   W  0 0 m31g¯1 g1m31 0 0  Note that also preserves linear structure.    0 0 m g¯ 0 g m 0   32 2 3 32    Then, with the usual shorthand of ypq = mpqg¯q, the 8 Note that Hamaker (2000) employs a similar formalism, but J H J term becomes: uses the (non-canonical) stacked rows definition instead. Radio interferometric gain calibration as a complex optimization problem 17

Of particular interest to us are two linear operators on the W isomorphism, the operators given by Eq. B15 are C2×2: right-multiply by some 2 × 2 complex matrix A, and Wirtinger derivatives in exactly the same sense that the Ja- left-multiply by A: cobians of Eq. B14 are Wirtinger derivatives, with the for- mer being simply the C2×2 manifestation of the gradient RAX = XA, LAX = AX (B7) operators in C4, as defined in Sect. 1. However, operating in The W isomorphism ensures that these operators can be C2×2 space allows us to write the larger Jacobians of Sect. 4 represented as multiplication of 4-vectors by specific kinds in terms of simpler matrices composed of operators, result- of 4 × 4 matrices. The matrix outer product proves to be ing in a Jacobian structure that is entirely analogous to the useful here, and in particular the following basic relation: scalar derivation. vec (ABC)=(CT ⊗ A) vec B, (B8) from which we can derive the matrix operator forms via: APPENDIX C: GRADIENT-BASED vec (IXA)=(AT ⊗ I) vec X OPTIMIZATION ALGORITHMS (B9) vec (AXI)=(I ⊗ A) vec X, This appendix documents the various standard least-squares which gives us optimization algorithms that are referenced in this paper:

a11 0 a21 0 0 a11 0 a21 C1 Algorithm SD (steepest descent) WRA =   (B10) a12 0 a22 0 (i) Start with a best guess for the parameter vector, z0;  0 a12 0 a22   (ii) At each step k, compute the residuals r˘k, and the   and Jacobian J = J(z˘k); (iii) Compute the parameter update as (note that due to a11 a12 0 0 redundancy, only the top half of the vector actually needs a21 a22 0 0 WLA =   (B11) to be computed): 0 0 a11 a12 H  0 0 a21 a22 ˘ ˘   δzk = −λJ rk, (C1)   From this we get the important property that where λ is some small value; 9 W H W W H W (iv) If not converged , set zk+1 = zk + δz, and go back [ RA] = RAH , [ LA] = LAH . (B12) to step (ii). Note that Eq. 4.5, which we earlier derived from the operator definitions, can now be verified with the 4×4 forms. Note also that Eq. 4.5 is equally valid whether interpreted C2 Algorithm GN (Gauss-Newton) in terms of chaining operators, or multipying the equivalent (i) Start with a best guess for the parameter vector, z ; 4 × 4 matrices. 0 (ii) At each step k, compute the residuals r˘k, and the Jacobian J = J(z˘k); B1 Derivative operators and Jacobians (iii) Compute the parameter update δz˘ using Eq. 1.9 with λ = 0 (note that only the top half of the vector actually Consider a matrix-valued function of a a matrix argument needs to be computed); and its Hermitian transpose, F (G, GH ). Yet again, we can (iv) If not converged, set zk+1 = zk + δz, and go back to employ the “vec” operator to construct a one-to-one mapping step (ii). between such functions and 4-vector valued functions of 4- vectors: C3 Algorithm LM (Levenberg-Marquardt) f(g, g¯) = vec F (vec−1 g, (vec−1 g¯)T ). (B13) Several variations of this exist, but a typical one is: Consider now the partial and conjugate partial Jaco- bians of f with respect to g and g¯, defined as per the for- (i) Start with a best guess for the parameter vector, z0, malism of Sect. 1. These are 4 × 4 matrices, as given by and an initial value for the damping parameter, e.g. λ = 1; Eq. 2.4, (ii) At each step k, compute the residuals r˘k, and the cost 2 r˘ ∗ function χk = || k||F . J k = [∂fi/∂gj ], J k = [∂fi/∂g¯j ], (B14) 2 2 (iii) If χk > χk−1 (unsuccessful step), reset zk = zk−1, that represent local linear approximations to f, i.e. linear and set λ = λK (where typically K = 10); operators on C4 that map increments in the arguments ∆g (iv) Otherwise (successful step) set λ = λ/K; and ∆g¯ to increments in the function value ∆f. The W (v) Compute the Jacobian J = J(z˘k); isomorphism defined above matches these operators to linear (vi) Compute the parameter update δz˘k using Eq. 1.9 operators on C2×2 that represent linear approximations to (note that only the top half of the vector actually needs to F . It is the latter operators that we shall call the Wirtinger be computed); H matrix derivatives of F with respect to G and G : (vii) If not converged, set zk+1 = zk + δz, and go back to step (ii). ∂F −1 ∂F −1 T = W (J k), = W (J k∗ ). (B15) ∂G ∂GH This is more than just a formal definition: thanks to 9 see below 18 O.M. Smirnov & C. Tasse

C4 Convergence All of the above algortihms iterate to “convergence”. One or more of the following convergence criteria may be imple- mented in each case: • Parameter update smaller than some pre-defined threshold: ||δz||F < δ0. • Improvement to cost function smaller than some pre- 2 2 defined threshold: χk−1 − χk < ǫ0. • Norm of the gradient smaller than some threshold: ||J||F <γ0.