Learning optimal bases using a neural network approach

Andreas Søgaard School of Physics and Astronomy, University of Edinburgh

Abstract A novel method for learning optimal, orthonormal wavelet bases for representing 1- and 2D signals, based on parallels between the wavelet transform and fully connected artificial neural networks, is described. The structural similarities between these two concepts are reviewed and combined to a “wavenet”, allowing for the direct learning of optimal wavelet filter coefficient through stochastic gradient descent with back-propagation over ensembles of training inputs, where conditions on the filter coefficients for constituting orthonormal wavelet bases are cast as quadratic regular- isations terms. We describe the practical implementation of this method, and study its performance for a few toy examples. It is shown that an optimal solutions are found, even in a high-dimensional search space, and the implica- tions of the result are discussed. Keywords: Neural networks, , machine learning, optimization

1. Introduction for any given problem depends on the class of signal, choosing the best among existing func- 1The Fourier transform has proved an indispensable tional families is hard and likely sub-optimal, and con- tool within the natural sciences, allowing for the study structing new bases is non-trivial, as mentioned above. of frequency information of functions and for the effi- Therefore, we present a practical, efficient method for cient representation of signals exhibiting angular struc- directly learning the best wavelet bases, according to ture. However, the Fourier transform is limited by be- some optimality criterion, by exploiting the intimate ing global: each frequency component carries no infor- relationship between neural networks and the wavelet mation about its spatial localisation; information which transform. might be valuable. Multiresolution, and in particular wavelet, analysis has been developed, in part, to address this limitation, representing a function at various lev- Such a method could have potential uses e.g. in ar- els of resolution, or at different frequency scales, while eas utilising time-series data and imaging, for instance retaining information about position-space localisation. — but not limited to — EEG, speech recognition, seis- This encoding uses the fact that due to their smaller mographic studies, financial markets as well as image wavelengths, high-frequency components may be lo- compression, feature extraction, and de-noising. How- calised more precisely than their low-frequency coun- ever, as is shown in Section 7, the areas to which such terparts. an approach can be applied are quite varied. arXiv:1706.03041v2 [cs.NE] 31 Aug 2018 The wavelet decomposition expresses any given sig- nal in terms of a “family” of functions In Section 2 we review some of the work previously [2, 3], efficiently encoding frequency-position informa- done along these lines. In Section 3 we briefly describe tion. Several different such wavelet families exist, both wavelet analyses, neural networks, as well as their struc- for continuous and discrete input, but these are gener- tural similarity and how they can be combined. In Sec- ally quite difficult to construct exactly as they don’t pos- tion 4 we discuss metrics appropriate for measuring the sess closed-form representations. Furthermore, the best quality of a certain wavelet basis. In Section 5 we de- scribe the actual algorithm for learning optimal wavelet bases. Section 6 describes the practical implementa- Email address: [email protected] (Andreas Søgaard) tion and, finally, Section 7 provides an example use case 1Sections 1 and 3 contain overlaps with [1]. from high-energy physics.

Preprint submitted to Neural Networks September 3, 2018 2. Previous work tics of the input signal(s), and relate only to the question of optimal representation at fixed scales. A typical approach [4, 5, 6] when faced with the task This indicates that, although the question of con- of choosing a wavelet basis in which to represent some structing optimal wavelet bases has been given substan- class of signals, is to select one among an existing set tial consideration, and clear developments have been wavelet families, which is deemed suitable to the par- made already, a general approach to easily learning dis- ticular use case based on some measure of fitness. This crete, demonstrably orthonormal wavelet bases of arbi- might lead to sub-optimal results, as mentioned above, trary structure and complexity, optimised over classes since limiting the search to a few dozen pre-exiting of input has yet to be developed and implemented for wavelet families will likely result in inefficient encod- practically arbitrary choice of optimality metric. This is ing or representation of (possibly subtle) structure par- what is done below. ticular, or unique, to the problem at hand. To address this shortcoming, considerable effort has already gone 3. Theoretical concepts into the question of the existence and construction of optimal wavelet bases. In this section, we briefly review some of the underly- Ref. [7] describes a method for constructing opti- ing aspects of wavelet analysis, Section 3.1, and neural mally matched wavelets, i.e. wavelet bases matching a networks, Section 3.2, upon which the learning algo- prescribed pattern as closely as possible, through lifting rithm is based. In Section 3.3 we discuss the parallels [8]. However, the proposed method is somewhat ardu- between the two concepts, and how these can be used to ous and relies on the specification of a pattern to which directly learn optimal wavelet bases. to match, requiring considerable and somewhat artifi- cial preprocessing.2 This is not necessarily possible, let 3.1. Wavelet alone easy, for many use cases as well as for the study Numerous excellent references explain multiresolu- of more general classes of inputs rather than single ex- tion analysis and the wavelet transform in depth, so the amples. In a similar vein, Ref. [9] provides a method present text will focus on the discrete class of wavelet for unconstrained optimisation of a wavelet basis with transforms, formulated in the language of matrix alge- respect to a sparsity measure using lifting, but has the bra as it relates directly to the task at hand. For a more same limitations as Ref. [7]. complete review, see e.g. [1] or [12, 15, 16, 17, 18]. Refs. [10, 11] provide theoretical arguments for the In the parlance of matrix algebra, the simplest possi- existence of optimal wavelet bases as well as an al- ble input signal f ∈ RN is a column vector gorithm for constructing such a basis for single 1- or  f[0]  2D inputs, based on gradient descent. However, results    f[1]  are only presented for low-order wavelet bases, the im-    .  plementation of constraints is not dis- f =  .  (1)   cussed, and the question of generalisation from single f[2M − 2]   inputs to classes of inputs is not addressed. In addi- f[2M − 1] tion, the optimal filter coefficients referenced in [11, Ta- ble 1] do not satisfy the explicit conditions (C2), (C3), and the dyadic structure of the wavelet transform means M and (C4) for orthonormality in Section 3.1 below. These that N must be radix 2, i.e. N = 2 for some M ∈ 3 constraints are violated at the 1%-level, which also cor- N0. The forward wavelet transform is then performed responds roughly to the relative angular deviation of the by the iterative application of low- and high-pass filters. reported optimal basis from the Daubechies [12] basis Let L(f) denote the low-pass filtering of input f, the i’th of similar order. entry of which is then given by the convolution Finally, Refs. [13, 14] provide a comprehensive pre- M 2X−1 scription for designing wavelets that optimally repre- L(f)[i] = a[k]f[i+N/2−k], i ∈ [0, 2M−1 −1] (2) sent signals, or classes of signals, at some fixed scale J. k=0 However, the results are quite cumbersome, are based assuming periodicity, such that f[−1] = f[N − 1], etc. on a number of assumptions regarding the characteris- The low-pass filter, a, is represented as a row vector of

2“It is difficult to find a problem our method can be applied to 3Although the results below are also applicable to 2D, i.e. matrix, without major modifications.” [7, p. 125]. input, cf. Section 7. 2 length Nfilt, with Nfilt even, and its entries are called the Since this is true for each entry, the full low-pass filter filter coefficients, {a}. may be represented as a (2M−1 × 2M) · (2M × 1) matrix The convolution yielding each entry i in L(f) can be inner product: seen as a matrix inner product of f with a row matrix of L(f) = LM−1 f (4) the form h i where, for each low-pass operation, ··· 0 a[N − 1] ··· a[1] a[0] 0 ··· the matrix operator is written as (3)

   ......   . . . .       ··· a[N − 1] ··· a[1] a[0] 0 0 0 0 ···      L =  ··· 0 0 a[N − 1] ··· a[1] a[0] 0 0 ···   2m (5) m     ··· 0 0 0 0 a[N − 1] ··· a[1] a[0] ···       . . . .    ......   | {z } 2m+1

lution of the signal with the high- and low-pass filters, In complete analogy to Eq. (5), a high-pass filter i.e. the right-most layers in Figure 1a, collectively en- m m+1 matrix Hm can be expressed as a 2 × 2 matrix code the same information as the position-space input f, parametrised in the same way by coefficients {b}, which but in the basis of wavelet functions. These are called we choose [12] to relate to {a} by the wavelet coefficients {c}. Given such a set of wavelet coefficients, the inverse transform can be perform by re- − k ∈ − bk = ( 1) aNfilt−1−k for k [0, Nfilt 1] (6) tracing the steps of the forward transform. Letting fm denote the input signal low-pass filtered down to scale The means that, given wavelet coefficients {a}, we have m, with fM ≡ f the inverse transform proceeds as specified the full wavelet transform in terms of repeated application of matrix operators Lm and Hm. The filter f0 = [c0] (7a) coefficients will therefore serve as our parametrisation f = LTf + HT[c ] (7b) of any given wavelet basis. 1 0 0 0 1 T T At each step in the transform, the power of 2 that f2 = L1 f1 + H1 [c2 c3] (7c) gives the current length of the (partially transformed) . input, n = 2m, is referred to as the frequency scale, m. . Large frequency scales m correspond to large input ar- T T f ≡ fM = LM−1fM−1 + HM−1[ c2M−1 ··· c2M −1 ] (7d) rays, which are able encode more granular, and there- fore more high-frequency, information than for small m. In this way it is seen that c0 encodes the average in- As the name implies, the low-pass filter acts as a spa- formation content in the input signal f, and that ci>0 tial sub-sampling of the input from frequency scale m to dyadically encode the frequency information at larger m − 1, averaging out the frequency information at scale and larger scales m. The explicit wavelet basis function m in the process. Similarly, the high-pass filter encodes corresponding to each wavelet coefficient can be found the frequency information at scale m; the information by setting c = [ ··· 0 1 0 ··· ] and studying the which is lost in the low-pass filtering. After each step, resulting, reconstructed position-space signal fˆ at some another pass of high- and low-pass filters are applied to suitable largest scale M. the sub-sampled, low-pass filtered input. This proce- The filter coefficients {a} completely specify the dure is repeated from frequency scale M to 0. At each wavelet transform and -basis, but they are not com- step, the high-pass filter encodes the frequency infor- pletely free parameters, however. Instead, they must mation specific to the current frequency scale. This is satisfy a number of explicit conditions in order to cor- illustrated in Figure 1a. responds to an orthonormal wavelet basis. These condi- The coefficients obtained through successive convo- tions [19] are as follows: 3 h1 h2

L0 f = h0 f c f L1 H0 h3

L2 H1 θ1 θ2 θ3

H2

(a) Wavelet (b) Neural network (c) “Wavenet”

Figure 1: Schematic representations of the difference in architecture for (a) standard wavelet transforms, (b) fully connected neural networks, and (c) the wavelet transform formulated as a neural network, here called “wavenet”. Individual squares indicate elements in layers, i.e. entries in column vectors. Shaded areas indicate filter- or weight matrices, where red/blue represent high-/low-pass filters.

In order to satisfy the dilation equation, the filter co- 3.2. Neural network efficients {a} must satisfy Since (artificial) neural networks have become ubiq- √ X uitous within most areas of the physical sciences, we ak = 2 (C1) k will only briefly review the central concepts as they re- late to the rest of this discussion. A comprehensive in- In order to ensure orthonormality of the scaling- and troduction can be found e.g. in Ref. [20]. wavelet functions, the coefficients {a} and {b} must sat- Neural networks can be seen general mappings f : isfy n m R → R , which can approximate any function, pro- X vided sufficient capacity. In the simplest case, such net- a a = δ ∀ m ∈ (C2) k k+2m m,0 Z works are constructed sequentially, where the input vec- k N0 tor f = h0 ∈ R is transformed through the inner prod- and uct with a weight matrix θ1, the output of which is a X N1 hidden layer h1 ∈ R , and so forth, until the output bkbk+2m = δm,0 ∀ m ∈ Z (C3) N layer hl ∈ l is reached. The configuration of a given k R neural network, in terms of number of layers and their where the condition for m = 0 is trivially fulfilled from respective sizes, is called the network architecture. In (C2) through Eq. (6). addition to the transfer matrices θi, the layers may be To ensure that the corresponding wavelets have zero equipped with bias nodes, providing the opportunity for area, i.e. encode only frequency information, we require an offset, as well as non-linear activation functions.A X schematic representation of one such network, without bk = 0 (C4) bias nodes and non-linearities, is shown in Figure 1b. k The neural network can then be trained on a set of Finally, to ensure of scaling and training examples, {(fi, yi)}, where the task of network wavelet functions, we must have usually is to output a vector yˆi trying to predict yi given X f . The quality of the prediction is quantified by the cost ∀ ∈ i akbk+2m = 0 m Z (C5) or objective function J(y, yˆ). The central idea is then k to take the error of any given preduction yˆi, given by where condition (C5) is automatically satisfied through the derivative of the cost function with respect to the Eq. (6). prediction at the current value, and back-propagate it Conditions (C1–5) then collectively ensure that the through the network, performing the inverse operation filter coefficients {a} (and {b}) yield a wavelet analy- of the forward pass at each layer. In this way, the gra- sis in terms of orthonormal basis functions. As we dient of the cost function J with respect to each en- parametrise our basis uniquely in terms of filter coef- try in the network’s weight matrices (θi) jk is computed. ficients {a}, since {b} are fixed through Eq. (6), we will Using stochastic gradient descent, for each training ex- need to explicitly ensure that these conditions are met. ample one performs small update steps of the weight The method for doing this is described in Section 3.3. matrix entries along these error gradients, which is then 4 expected to produce slightly better performance of the a neural network like the one in Figure 1c, mapping network with respect to the task specified by the cost R8 → R8 will have 84 free parameters in the standard function. treatment. However, identifying each of the 6 weight One challenge posed by such a fully connected net- matrices with the wavelet filter operators, this number work is the shear multiplicity of weights for just a few is reduced to Nfilt, which can be as low as 2. This is layers of moderate sizes. Such a large number of free schematically shown in Figure 2. For inputs of “real- parameters can make the network prone to over-fitting, istic” sizes, i.e. |f| = N & 64 this reduction is expo- which can be mitigated e.g. by L2 weight regularisation, nentially greater, leading to a significant reduction of where a regularisation term R({θ}) is added to the cost complexity. function, with a multiplier λ controlling the trade-off be- Finally, we note that the filter coefficients need to tween the two contributions. conform with conditions (C1–5), cf. Section 3.1 above, in order to correspond to an orthonormal wavelet basis. 3.3. Combining concepts This can be solved by noting that all conditions (C1–5) The crucial step is then to recognise the deep parallels are differentiable with respect to {a}, which means that between these two constructs. We can cast the discrete we can cast these conditions in the form of quadratic N N wavelet transform as an R → R neural network with regularisation terms, Ri, which can then be added to a fully-connected, deep, non-sequential, dyadic archi- the cost function with some multiplier λ, in analogy tecture without bias-units and with linear (i.e. no) acti- to standard L2 weight regularisation. The multiplier λ vations. A schematic representation of this setup, here then controls the trade-off between the possibly compet- called a “wavenet”, is shown in Figure 1c. This is done ing objectives of optimising J and ensuring fulfillment by identifying the neural network transfer matrices with of conditions (C1–5). In principle, this means that for the low- and high-pass filter operators in the matrix for- finite λ any learned filter configuration {a} might vio- mulation of the wavelet transform, cf. Eq. (5). The for- late these conditions to order 1/λ, and might therefore ward wavelet transform then corresponds to the neural strictly be taken to constitute a “pseudo-orthonormal” network mapping, and the output vector of the neural basis. This will, however, have little impact in practical network is exactly the wavelet coefficients of the input application, where one can simply choose a value of λ with respect to the basis prescribed by {a}. sufficiently high that O(1/λ) is within the tolerances of If we can formulate an objective function J for the the use case at hand. wavelet coefficients, i.e. the output of the “wavenet”, this means that we can utilise the parallel with neural 4. Measuring optimality networks and employ back-propagation to gradually up- date the weight matrix entries, i.e. the filter coefficients The choice of objective function defines the sense in {a}, in order to improve our wavelet basis with respect which the basis learned through the method outlined in to this metric. Therefore, choosing a fixed filter length Section 3.3 will be optimal. This also affords the user |{a}| = Nfilt, and parametrising the “wavenet” in terms of a certain degree of freedom in defining the measure of {a}, we are able to directly learn the wavelet basis which optimality, the only condition being that the objective is optimal according to some task J. function be differentiable with respect to the wavelet co- Interestingly, and unlike some of the approaches efficients {c}.4 mentioned in Section 2, a neural network approach nat- In this example we choose sparsity, i.e. the ability urally accommodates classes of inputs, in addition to of a certain basis to efficiently encode the information single examples. That is, one can train repeatedly on a contained in a given signal, as our measure of optimal- single example and learn a basis which optimally rep- ity. From the point of view of compression, sparsity is resents this particular signal in some way, cf. e.g. [7]. clearly a useful metric, in that it measures the amount However, the use of stochastic gradient descent is natu- of information that can be stored with a within certain rally suited for fitting the weight matrices to ensembles amount of space/memory. From the point of view of of training examples, which in many cases is much more representation, sparsity is likely also a meaningful ob- meaningful and useful, cf. Section 7. jective, since a basis which efficiently represents the Another key observation is that while the entries in defining features of a (class of) signal(s) will also lead a standard neural network wight matrix are free pa- the signal(s) to be sparse in this basis. rameters, the weights in the “wavenet” are highly con- strained, since they must correspond to the low- and high-pass filters of the wavelet transform. For instance, 4Possibly except for a finite number of points. 5 (a) Neural network (b) Wavelet

Figure 2: Schematic representation of the entries in a 8 × 16 (a) transfer matrix in a unconstrained, fully connected neural network and (b) a corresponding filter operator in a wavelet transform with Nfilt = 4 filter coefficients. Note that the entries in each row of the wavelet matrix operator are identical, and simply shifted by integer multiples of 2, cf. Eq. (2), such that the number of free parameters is only Nfilt.

Based on [21], we choose the Gini coefficient G( · ) where {c} is the set of wavelet coefficients for a given as our metric for the sparsity of a set of wavelets coeffi- training example and {a} is the current set of filter co- cients {c}, efficients. The R-term ensures that the filter coefficient configuration {a} does indeed correspond to a wavelet PNc−1 (2i − Nc − 1)|ci| f ({c}) basis as defined by conditions (C1–5) above; the S-term G({c}) = i=0 ≡ (8) PNc−1 measures the quality of a given wavelet basis according Nc |ci| g({c}) i=0 to the chosen fitness measure. The learning task then for wavelet coefficients {c} sorted by ascending absolute consists of optimising the filter coefficients according to value, i.e. |ci| ≤ |ci+1| for all i. Here Nc ≡ |{c}| is the this combined objective function, i.e. finding a filter co- number of wavelet coefficients. efficient configuration, in an Nfilt-dimensional parame- A Gini coefficient of 1 indicates a completely un- ter space, which minimises J. The procedure for com- equal, and therefore maximally sparse, distribution, puting a filter coefficient gradient for each of the two i.e. the case in which only one coefficient has non-zero terms is outlined below. value, and therefore carries all of the information con- tent in the signal. Conversely, a Gini coefficient of 0 5.1. Sparsity term indicates a completely equal distribution, i.e. each coef- Based on the discussion in Section 4, we have cho- ficient has exactly the same (absolute) value, and there- sen the Gini coefficient G( · ) as defined in Eq. (8) as our fore all carry exactly the same amount of information measure of the sparsity of any given set of wavelet coef- content. ficients {c}. The sparsity term in the objective function Having settled on a choice of objective function, we is chosen to be now proceed to describing the details of the learning S({c}) = 1 − G({c}) (10) procedure itself. We stress that the results of the fol- lowing sections should generalise to other reasonable This definition means that low values of S({c}) corre- choices of objectives, which may be chosen based on spond to greater degree of sparsity, such that that min- the particular use case at hand. imising this objective function term increases the degree of sparsity. In order to utilise stochastic gradient descent with 5. Learning procedure back-propagation, the objective function needs to be differentiable in the values of the output nodes, i.e. the As noted above, the full objective function for the wavelet coefficients. Since the sparsity term is the only optimisation problem is given as the sum of a sparsity term which depends on the wavelet coefficients, partic- term S({c}) and a regularisation term R({a}), the relative ular care needs to be afforded here. The sparsity term contribution of the latter controlled by the regularisation is seen to be differentiable everywhere except for a fi- constant λ, i.e. nite number of points where ci = 0. In these cases the derivative is taken to be zero, which is meaningful con- J({c}, {a}) = S({c}) + λ R({a}) (9) sidering the chosen optimisation objective: coefficients 6 of value zero will, assuming at least one non-zero co- given, fixed filter length Nfilt, entries in the filter ma- efficient exists, contribute maximally to the sparsity of trices which are identically zero are not modified by a the set as a whole. Therefore we don’t want these coeffi- gradient. Conversely, the gradient on every filter matrix cients to change, and the corresponding gradient should entry to which a particular filter coefficient is contribut- be zero.5 ing is added to the corresponding sparsity gradient in Therefore, assuming ci , 0, the derivative of the spar- filter coefficient space, possibly with a sign change in sity term is given by (suppressing the arguments of the the case of high-pass filter matrices, cf. Eq. (6). In this objective function terms for brevity) way, the gradient on the wavelet coefficients is trans- lated into a gradient in filter coefficient space, which we dS d ∇|c| S ≡ eˆi = eˆi (1 − G) can then use in stochastic gradient descent, along with d|ci| d|ci| a similar regularisation gradient, to gradually improve = −∇|c| G our wavelet basis as parametrised by {a}. ∇| | f · g − f · ∇| | g = − c c (11) g2 where 5.2. Regularisation term Nc−1  d X  ∇|c| f = eˆi  (2k − Nc − 1)|ck| d|c |   i k=0 = (2i − Nc − 1) eˆi (12) The regularisation terms are included to ensure that the optimal filter coefficient configuration does indeed and correspond to an orthonormal wavelet basis as defined   through conditions (C1–5). As noted in Section 3.3,  NXc−1  d   we choose to cast cast these conditions in the form of ∇|c| g = eˆi Nc |ck| = Nc eˆi (13) d|c |   i k=0 quadratic regularisation conditions on the filter coeffi- cients {a}. Each of the conditions (C1–5) is of the form for f and g defined in Eq. (8), where summation of vec- tor indices is implied. To get the gradient with respect the the signed coeffi- hk({a}) = dk (16) cient values, the gradients of f and g are multiplied by the corresponding coefficient sign, i.e. which can be written as a quadratic regularisation term,

∇c f = sign(c) × ∇|c| f (14) i.e. and 2 Rk({a}) = (hk({a}) − dk) (17)

∇c g = sign(c) × ∇|c| g (15) and the combined regularisation term is then given by where × indicates element-wise multiplication. The gradients with respect to the base, non-sorted set of 5 wavelet coefficients {c}, ∇c f and ∇c g respectively, are X found by performing the inverse sorting with respect to R({a}) = Rk({a}) (18) k=1 the absolute wavelet coefficient values. In this way ∇c S can be computed from ∇c f and ∇c g through Eq. (11). Having computed the gradient of the sparsity cost This formulation allows for the search to proceed in with respect to the output nodes (wavelet coefficients) the full Nfilt-dimensional search space, and the regu- we can now use standard back-propagation on the full larisation constant λ regulates the degree of precision network to compute the associated gradient on each en- to which the optimal filter coefficient configuration will try in the low- and high-pass filter matrices. For a fulfill conditions (C1–5). In order to translate deviations from conditions (C1– 5Cases with all zero-valued coefficients are ill-defined but also 5) into gradients in filter coefficient space, we take the practically irrelevant. derivative of each of the terms Rk with respect to the 7 filter coefficients ai. The gradients are found to be: tion 4, using the implementation in Sections 5 and 6, we   choose that of hadronic jets produced at proton collid- h X i √  ∇ R  −  ers. In particular, the input to the training is taken to be a 1 = eˆi 2  ak 2 (D1) k simulated quantum chromodynamics (QCD) 2 → 2 pro-   X X h i  cesses, generated in Pythia8 [27, 28], segmented into a ∇ R  −  a 2 = eˆi 2  akak+2m δm,0 2D array of size 64×64 in the η−φ plane, roughly corre- m k sponding to the angular granularity of present-day gen- × (ai+2m + ai−2m) (D2) eral purpose particle detectors. The collision events are   √ X X h i  generated at a center of mass energy of s = 13 TeV ∇ R ˆ  −  a 3 = ei 2  bkbk+2m δm,0 with a generator-level p⊥ cut of 280 GeV imposed on m k the leading parton. × (ai+2m + ai−2m) (D3) QCD radiation patterns are governed by scale-   X  independent splitting kernels [29], which could make ∇ R   × − N−i−1 a 4 = eˆi 2  bk ( 1) (D4) them suitable candidates for wavelet representa- k tion, since these naturally exhibit self-similar, scale- ∇a R5 = 0 (D5) independent behaviour. In that case, the optimal (in the sense of Section 4) representation is one which effi- Since condition (C5) is satisfied exactly by the defini- ciently encodes the localised angular structure of this tion in Eq. (6), the corresponding gradient is identically type of process, and could be used to study, or even equal to zero. learn, such radiation patterns. In addition, differences The combined gradient from the regularisation term in representation might help distinguish between such is then the sum of the above five (four) contributions. non-resonant, one-prong “QCD jets” and resonant, two- prong jets e.g. from the hadronic decay of the W and Z 6. Implementation eletroweak bosons. The learning procedure based on the objective func- We also note that, as alluded to in Section 3.3, for tion and associated gradients presented in Section 5 signals of interest in collider physics, a standard neural is implemented [22] as a publicly available C++ [23] network with “wavenet” architecture contains an enor- 2D 7 package. The matrix algebra operations are imple- mous number of free parameters, e.g. Nc ≈ 4.4 × 10 mented using armadillo [24], with optional interface to for N × N = 64 × 64 input, which is reduced to Nfilt, the high-energy physics root library [25]. i.e. as few as two, by the parametrisation in terms of the This package allows for the processing of 1- and 2D filter coefficients {a}. dimensional training examples of arbitrary size, pro- We apply the learning procedure using Ref. [22], it- vides data generator for a few toy examples and reads erating over such “dijet” events pixelised in the η − φ CSV input as well as high-energy physics collision plane, and use back-propagation with gradient descent events in the HepMC [26] format. The 2D wavelet trans- to learn the configuration of {a} which, for fixed Nfilt, form is perform by performing the 1D transform on minimises the combined sparsity and regularisation in each row in the signal, concatenating the output rows, Eq. (9). This is shown in Fig. 3 for Nfilt = 2. and then performing the 1D transform on each of the It is seen that, for Nfilt = 2, only one minimum resulting columns. Their matrix concatenation then cor- exists, due to only one point in a1 − a2 space fulfill- responds to the 2D set of wavelet coefficients. ing all five conditions√ (C1–5). This configuration has In addition to standard (batch) gradient descent, the a1 = a2 = 1/ 2 and is exactly the Haar wavelet [30]. library allows for the use of gradient momentum and Although this is an instructive example allowing for simulated annealing of the regularisation term in or- clean visualisation, showing the clear effect of the gra- der to ensure faster and more robust convergence to the dient descent algorithm and the efficacy of the interpre- global minimum even in the presence of local minima tation of conditions (C1–5) as quadratic regularisation and steep regularisation contours. terms, it also doesn’t tell us much since the global min- imum will be the same for all classes of inputs. For N > 2 the regularisation allows for minima in an ef- 7. Example: QCD 2 → 2 processes in high-energy filt fective hyperspace with dimension D > 0. physics Instead choosing Nfilt = 16 we can perform the same As an example of the procedure for learning optimal optimisation, but now with sufficient capacity of the wavelet bases according to the metric presented in Sec- wavelet basis to encode the defining features of this 8 QCDW (→ 2qq) → +2 jets, p > 280 GeV s = 13 TeV T 2 corresponds to an orthonormal basis (i.e. not necessar- 2 10 ily an orthonormal wavelet basis) the learning procedure 1 results in the pixel basis, i.e. the one in which each basis function corresponds to a single entry in the input array. Filter coeff. a 0.5 10 This shows that, due to the fact that QCD showers are fundamentally point-like (due to the constituent parti- cles) and since they, to leading order, are dominated by 0 a few particles carrying the majority of the energy in the 1 jet, the representation which best allows for representa- Cost (sparsity + regularisation) [a.u.] -0.5 tion of single particles will prove optimal according to our chosen measure Eq. (8). However, since this exam- ple studies the optimal representation of entire event, its 10-1 -1 conclusions may change for inputs restricted to a certain

-1 -0.5 0 0.5 1 region in η − φ space around a particular jet, i.e. for the Filter coeff. a 1 study of optimal representation of jets themselves.

Figure 3: Map of the average total cost (regularisation and sparsity) → for QCD 2 2 events withp ˆ⊥ > 280 GeV, for only two filter Acknowledgments coefficients a1,2. Initial configurations are generated on the unit circle in the a1 − a2 plane (red dots on dashed red line), to initially satisfy condition (C2), and better configurations are then learned iteratively The author is supported by the Scottish Universities (solid black lines) by using back-propagation with gradient descent, until a minimum (blue dot(s)) is found. Physics Alliance (SUPA) Prize Studentship. The author would like to thank Troels C. Petersen for insightful dis- cussions on the subject matter, and James W. Monk for providing Monte Carlo samples. class of signals. The effect of the learning procedure is presented in Figure 4, showing a selection of the lowest- scale wavelet basis functions corresponding to particu- References lar filter coefficient configurations at the beginning of, during, and at convergence of the learning procedure in [1] A. Søgaard, Boosted bosons and wavelets, M.Sc. thesis, Univer- this higher-dimensional search space. sity of Copenhagen (August 2015). The random initialisation on the unit hyper-sphere is [2] J. Morlet et al., Wave propagation and sampling theory, Part I: shown to produce random noise (Figure 4a), which does Complex signal land scattering in multilayer media, J. Geophys. 47 (1982) 203–221. not correspond to a wavelet basis, since the algorithm [3] J. Morlet et al., Wave propagation and sampling theory, Part has not yet been afforded time to update the filter coeffi- II: Sampling theory and complex waves, J. Geophys. 47 (1982) cients to conform with the regularisation requirements. 222–236. [4] A. Mojsilovic,` M. V. Popovic,` and D. M. Rackov, On the Selec- At some point roughly half way through the training, tion of an Optimal Wavelet Basis for Texture Characterization, the filter coefficient configuration does indeed yield an in: IEEE Transactions on Image Processing, Vol. 4, 2000. orthonormal wavelet basis (Figure 4b), and the learn- [5] H. Qureshi, R. Wilson, and N. Rajpoot, Optimal Wavelet Basis ing procedure now follows the gradients towards greater for Wavelet Packets based Meningioma Subtype Classification, in: 12th Medical Image Understanding and Analysis (MIUA sparsity along a high-dimensional, quadratic regulari- 2008), 2008. sation “valley”. Finally, at convergence, the optimal [6] O. Pont, A. Turiel, and C. J. Prez-Vicente, On optimal wavelet wavelet found is again seen to be exactly the Haar bases for the realization of microcanonical cascade processes, wavelet (Figure 4c), despite the vast amount of free- International Journal of Wavelets Multiresolution and Informa- tion Processing 9 (1) (2011) 35–61. dom provided the algorithm by virtue of 16 filter co- [7] H. Thielemann, Optimally matched wavelets, Ph.D. thesis, Uni- efficients. That is, the learning procedure arrives at the versitat¨ Bremen (March 2006). optimal configuration by setting 14 filter coefficients to [8] W. Sweldens, The Lifting Scheme: A Construction of Second exactly zero without any manual tuning. Generation Wavelets, Journal on Mathematical Analysis 29 (2) (1997) 511–546. This result shows that limiting the support of the ba- [9] N. P. Hurley et al., Maximizing sparsity of wavelet represen- sis functions provides for more efficient representation tations via parameterized lifting, in: 15th International Confer- than any deviations due to radiation patterns could com- ence on Digital Signal Processing, 2007, pp. 631–634. [10] Y. Zhuang and J. S. Barras, Optimal wavelet basis selection for pensate for. Indeed, it can be show that removing some signal representation, in: Proc. SPIE, Vol. 2242 of Wavelet Ap- of the conditions (C1–5) so as to ensure that {a} simply plications, 1994, pp. 200–211. 9 (a) Initial configuration (b) Intermediate configuration (c) Final configuration

Figure 4: Examples of the 64 lowest-scale 2D wavelet basis functions, found by optimisation on rasterised QCD 2 → 2 events withp ˆ⊥ > 280 GeV in an Nfilt = 16-dimensional filter coefficient space, (a) at initialisation, (b) at an intermediate point during training and (c) at termination of the learning procedure upon convergence.

[11] Y. Zhuang and J. S. Barras, Constructing optimal wavelet ba- [20] C. M. Bishop, Neural networks for pattern recognition, Claren- sis for image compression, in: IEEE International Conference don Press, 1995. on Acoustics, Speech, and Signal Processing, Vol. 4, 1996, pp. [21] N. P. Hurley and S. T. Rickard, Comparing Measures of Spar- 2351–2354. sity, IEEE Transactions on Information Theory 55 (10) (2009) [12] I. Daubechies, Ten Lectures on Wavelets, CBMS-NDF Regional 4723–4741. Conference Series in Applied . Society for Indus- [22] A. Søgaard, “Wavenet” package, Retrieved from www.github. trial and Applied Mathematics (SIAM) (1992). com/asogaard/Wavenet on September 3, 2018 (2017). [13] A. H. Tewfik, D. Sinha, and P., On the Optimal Choice of a [23] B. Stoustrup, The C++ Programming Language, Pearson Edu- Wavelet for Signal Representation, in: IEEE Transactions on cation India, 1995. Information Theory, Vol. 38, 1992. [24] C. Sanderson and R. Curtin, Armadillo: a template-based C++ [14] R. A. Gopinath, J. E. Odegard, and C. S. Burrus, Optimal library for linear algebra, Journal of Open Source Software wavelet representation of signals and the wavelet sampling the- 1 (26). orem, in: IEEE Transactions on Circuits and Systems II: Analog [25] R. Brun and F. Rademakers, ROOT - An Object Oriented Data and Digital Signal Processing, Vol. 41, 1994. Analysis Framework, Nucl. Inst. & Meth. in Phys. Res. A 389 [15] S. G. Mallat, A Theory for Multiresolution Signal Decomposi- (1997) 81–86, see also http://root.cern.ch/. tion: The Wavelet Representation, IEEE Transactions on Pattern [26] M. Dobbs and J. B. Hansen, The HepMC C++ Monte Carlo Recognition and Machine Intelligence 11 (7) (1989) 674–693. Event Record for High Energy Physics, Comput. Phys. Com- [16] Y. Meyer, Wavelets and Operators, Cambridge University Press, mun. 134 (41). 1992. [27] T. Sjostrand,¨ S. Mrenna, and P. Skands, A Brief Introduction to [17] A. Jensen and A. la Cour-Harbo., Ripples in Mathematics: the PYTHIA 8.1, JHEP 05 (026). Discrete Wavelet Transform, Springer, 2001. [28] T. Sjostrand,¨ S. Mrenna, and P. Skands, A Brief Intro- [18] J. Williams and K. Amaratunga, Introduction to Wavelets in duction to PYTHIA 8.1, Comput. Phys. Comm. 178 (852), Engineering, Int. Journ. Num. Meth. Eng. 37 (14) (1994) [arxiv:0710.3820]. 23652388. [29] A. Buckley et al., General-purpose event generators for LHC [19] D. S. G. Pollock, The Framework of a Dyadic Wavelets Anal- physics, Phys. Rept. 504 145–233. ysis, retrieved from http://www.le.ac.uk/users/dsgp1/ [30] A. Haar, Zur Theorie der orthogonalen Funktionensysteme, SIGNALS/MoreWave.pdf on September 3, 2018. Mathematische Annalen 69 (3) (1919) 331–371.

10