A New Algorithm for Non-Negative Sparse Approximation Nicholas Schachter

To cite this version:

Nicholas Schachter. A New Algorithm for Non-Negative Sparse Approximation. 2020. ￿hal- 02888300v1￿

HAL Id: hal-02888300 https://hal.archives-ouvertes.fr/hal-02888300v1 Preprint submitted on 2 Jul 2020 (v1), last revised 9 Jun 2021 (v5)

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A New Algorithm for Non-Negative Sparse Approximation

Nicholas Schachter July 2, 2020

Abstract In this article we introduce a new algorithm for non-negative sparse approximation problems based on a combination of the approaches used in orthogonal matching pursuit and basis de-noising pursuit towards solving sparse approximation problems. By taking advantage of structural properties inherent to non-negative sparse approximation problems, a branch and bound (BnB) scheme is developed that enables fast and accurate recovery of underlying dictionary atoms, even in the presence of noise. Detailed analysis of the performance of the algorithm is discussed, with attention specically paid to situations in which the algorithm will perform better or worse based on the properties of the dictionary and the required sparsity of the solution. Performance on test sets is presented along with possible directions for future research and improvements.

1 Introduction

Non-negative sparse approximation (NNSA) is a special case of the sparse approximation (SA) problem. In SA we are given a dictionary D ∈ Rmxn where m < n and a signal vector y ∈ Rm and asked nd x ∈ Rn such that ||y − Dx||p, for a given choice of p-norm, and ||x||0 are minimized. In NNSA we add the constraints x ≥ 0 element-wise and y − Dx ≥ 0 1. In specroscopy and applied chemistry, this problem is sometimes called mixture analysis, due to being commonly used to analysis unknown chemical mixtures.

Unfortunately, optimization problems involving the 0-norm are known to be NP-Hard in many cases [6] so we must make do with methods of approximation [4]. Commonly chosen methods for this include `1 reg- ularization (also known as ) [15], elastic net regularization (which includes `1 and `2 regularization as special cases) [19], matching pursuit and its extensions [12], and proximal gradient methods [5].

When D satises certain conditions relating to the spark2 it can be shown that the convex re- laxation of the 0-norm formulation are guaranteed to nd the optimal solution to the problem [8], [16]. In situations where these conditions are not met these methods can often result in solutions that contain no- ticeable residual components. When it is suspected that some of the components of the measured signal are small relative to the others, determining whether these residual signals are necessary or extraneous becomes non-trivial.

This challenge is especially acute for NNSA instances where it likely that a substance of interest is a very small proportion of the measured signal in total. Methods that are based on minimizing the residual norm will naturally be more prone towards missing these smaller components. Additionally, when the measured signal contains noise there are further complications.

1This constraint prevents overexplaining of the signal vector and is not strictly necessary, but is included here due to both empirically improving results and simplifying some of the analysis. 2The supremum of all integers k such that there exists a set of k columns of D that are linearly independent. In most (if not all) real-life scenarios there will be some degree of imprecision, or noise, present in the measured signal. If the signal/noise ratio in the measured signal is suciently large, methods based on a convex relaxation of the `0-norm constraint will lead to extraneous coecients being non-zero. Empirically, the signal/noise ratio does not have to be particularly poor in order for this eect to occur; in one of the tests detailed in section X the signal/noise ratio is roughly 10 to 1 and this eect can be observed. Fundamentally, this is because we are not actually solving s.t. but s.t. argminx ||y − Dx||p x ≥ 0 argminx ||ye− Dx||p x ≥ 0 where is our inexact measurement of the pure signal. ye As the goal of NNSA is to estimate the estimate the relative amounts of dictionary atoms present in a signal, the ideal point of comparison is the pure signal, not the measured one. Obviously in real life we cannot actually observe the pure signal, but with some assumptions we can reconstruct a very good post-hoc approximation of it. The primary assumption necessary is that the dictionary D contains nearly noise-free representations of its atoms. This is a reasonable assumption to make when either the dictionary has been compiled on a more precise instrument than the tool being used to take measurements of the target spectra, or the dictionary atoms are the results of numerous measurements averaged out in order to reduce noise, as is often the case in applications of hand-held spectroscopic tools.

The second major assumption is that the noise present in the measurement is randomly distributed (ideally with a mean of 0, but this is not strictly necessary). If the noise is randomly distributed its eects will, on average, be distributed globally across the measured signal and thus the improvement in t gained by accounting for the noise in one area will be outweighed by the loss introduced elsewhere as a result. When combined with an explicit sparsity constraint, which ensures that extraneous components are not added to the estimate of the pure signal in an attempt to account for noise, we can generate a close approximation of the pure signal by nding the subset of atoms in the dictionary that best ts the measured signal. This is the fundamental underpinning of the algorithm presented in this article.

In section 2 we describe the algorithm, prove that it always results in an optimal solution to the NNSA problem, and examine its asymptotic time-complexity. Section 3 contains theoretical analysis of the condi- tions necessary for the algorithm to perform optimally in terms of speed, as well as detailed exposition on the relevance of the composition of the dictionary atoms when performing sparse approximation in general. Section 4 presents the results of computational tests done using this algorithm with a simulated dictionary as compared to using LASSO, with specic emphasis on the relative performances in correctly identifying and weighting smaller components in the measured signal. Section 5 summarizes the conclusions of this article and describes possible areas of improvement and future research.

2 The Algorithm

Given a dictionary D ∈ Rmxn, a measured vector y ∈ Rm, a sparsity parameter k, and a noise estimation , the algorithm works as follows.

1. Set and set the minimum residual to ||y||1− . j = 0 `1 k−j

2. For all atoms in D, minimize ||Dix − y||1 subject to x ≥ 0 and y ≥ Dix element-wise. Store the indices of all atoms such that ||y||1− . 1 − ||Dix − y||1 ≥ k−j 3. Set j = j + 1

4. If , set the minimum residual to ||y||1− , otherwise return the best value of and the corre- k > j `1 k−j x sponding indices.

5. For each stored index (or combination of indices), iterate over all other atoms in D and minimize ||DSx − y||1 subject to x ≥ 0 and y ≥ Dix element-wise, where S is the set of atoms being examined. Store all combinations of indices such that ||y||1− . 1 − ||DSx − y||1 ≥ k−j 6. If k > j, go to step 4, otherwise return the best value of x and the corresponding indices.

2 Algorithm 1 BnB Algorithm for NNSA

Precondition: D is an m by n dictionary with m < n,  is an estimate of the `1-norm of the noise in y

1: function SparseApprox(D, k, y, ) 2: ||y||1− λ ← k 3: S1, γ, ρ ← ∅ 4: τ ← ∞ 5: for i ← 1 to n do 6: x ← argminx ||Dix − y||1 s.t. x ≥ 0 & y ≥ Dix element-wise 7: if 1 − ||Dix − y||1 ≥ λ then 8: S1 ← [S1 ; i] . [x ; y]: concatenation of x and y

9: if ||Dix − y||1 < τ then 10: τ ← ||Dix − y||1 11: γ ← x 12: ρ ← i 13: for i ← 2 to k do 14: Si ← ∅ 15: ||y||1− λ ← k−i−1 16: for j ← 1 to |Si−1| do 17: for q ← 1 to n do 18: s ← [Si−1[j] ; q] . The elements of Si are sets of indices of size i 19: x ← argminx ||Dsx − y||1 s.t. x ≥ 0 & y ≥ Dsx element-wise 20: if 1 − ||Dsx − y||1 ≥ λ then 21: Si ← [Si ; s]

22: if ||Dsx − y||1 < τ then 23: τ ← ||Dsx − y||1 24: γ ← x 25: ρ ← s 26: return τ, γ, ρ

This algorithm can be seen as an extension of orthogonal matching pursuit, which iterates k times over the atoms in the dictionary greedily tracks the subset of atoms that results in the best t to the observed signal [17]. By taking advantage of the non-negativity of the data being analyzed, we can use the properties of the `1 norm to establish an upper bound of goodness of t for each atom (and combination of atoms) as the algorithm progresses, as well as a minimum goodness of t bound at each stage, removing candidate solutions whose upper bound does not satisfy the lower bound for an optimal solution. In other words, a breadth-rst branch and bound scheme.

3 2.1 Proof Of Optimality First, we state the theorem explicitly.

Theorem 2.1. Algorithm 1 is guaranteed to return an optimal solution to the NNSA problem. To prove this, we will need a simple lemma.

Lemma 2.2. For vectors with non-negative entries, the `1 norm satises the triangle-inequality exactly ⇒ ||x||1 + ||y||1 = ||x + y||1.

Proof. For vectors with non-negative entries, the `1 norm satises the triangle-inequality exactly ⇒ ||x||1 + . This follows from the fact that Pn , which is equivalent to Pn when ||y||1 = ||x + y||1 ||x||1 = i=1 |xi| i=1 xi x contains only non-negative values. Therefore, given two vectors x and y containing only non-negative values, Pn Pn Pn Pn Pn ||x||1 + ||y||1 = i=1 |xi| + i=1 |yi| = i=1 xi + i=1 yi = x1 + y1 + ... + xn + yn = i=1(xi + yi) = Pn . i=1 |xi + yi| = ||x + y||1 3 We are now ready to prove theorem 2.1. Assume without loss of generality that ||x||1 = 1. It follows that if x, then x 1 . To see this, note that if x 1 α1x1 +...+αkxk = max(||αixi − ||1) ≤ 1− k max(||αixi − ||1) > 1− k then by lemma 2.2 we would have Pk x , which is a contradiction. This also shows || i=1 αixi||1 = || ||1 < 1 that if we select the largest coecients from the set of αi along with their corresponding vectors xi, x 1 , where is the selected subset. ||αSxS − ||1 ≤ 1 − k−|S|−1 S

From the above, we know that the atom of D such that r = ||Dix − y||1 is minimized will satisfy 1 r ≤ ||y|| (1 − ) (1) 1 k Therefore, as the algorithm nds and tracks all atoms of D satisfying that criteria, it cannot miss said atom. Furthermore, because the algorithm tracks all sets of atoms S with sizes up to k satisfying the requirement 1 max(||D x − y|| ) ≤ ||y|| (1 − ) (2) S 1 1 k − |S| − 1

it cannot miss the set of atoms of size k that minimizes ||DSx−y||1. This proves the algorithm is guaranteed to nd the optimal solution.

2.2 Algorithm Time Complexity

To determine the time complexity of the algorithm, we examine each step in sequence. Step 1 requires O(m) 3 time to compute the `1 norm of y. Step 2 requires O(nm ) time, as solving argminx ||Dix − y||1 s.t. x ≥ 0 & y ≥ Dix element-wise can be done via linear least squares followed by rescaling to meet the constraints, which requires O(m3) time, when only a single atom of D is being examined.

Steps 4 and 5 have unit cost. Step 6 has cost O(ζnL) cost where ζ is the number of stored combina- tions of indices of size j and L is the time complexity of nding the value x that minimizes ||Dsx − y||1 subject to x ≥ 0 & y ≥ Dsx. Solving argminx ||Dsx−y||1 s.t. x ≥ 0 & y ≥ Dix element-wise is equivalent to performing constrained least absolute deviation regression (LADR) on a selected subset s of D with respect to the signal vector y. LADR can be solved via (LP), and all of the constraints in this case are linear, so clearly L is polynomial. Steps 4 through 6 are repeated a further k − 2 times, giving a total cost of O(kζL + nm3).

In the worst case, every combination of atoms in D will satisfy the `1 residual constraints, which gives the algorithm a time complexity of n 2 . For a xed value of not dependent on , n k O( k L + n ) k n O( k ) = O(n ) ⇒

O(nkL + nm3) (3)

3 We can rescale any vector without this property by dividing by ||x||1.

4 When taken as a function of n and k, this is not a polynomial time running bound. It is not even xed parameter tractable (FPT), because the denition of FPT excludes functions of the form f(n, k) = nk. However, it does belong to the complexity class XPuniform, which contains all problems solvable via a uniform 4 f(k) 5 algorithm in time O(n ) [9]. XPuniform contains many problems that are fundamentally intractable , but because in this case the function f(k) is linear, this algorithm is feasible when k is small. However, for values of not substantially smaller than , where n achieves its maximum value, this algorithm can be very k bn/2c k slow.

2.3 Minimizing the `1-norm Residual

As noted in the previous section, solving argminx ||Dsx − y||1 s.t. x ≥ 0 & y ≥ Dsx element-wise can be done via Linear Programming. That said, it is inecient to repeatedly call a LP solver in code when a more specic method can be used. By taking advantage of how the constraints restrict the set of feasible solutions, we can formulate a gradient descent method.

The absolute value function is not dierentiable over R, specically at 0, so in general gradient descent methods are not applicable to `1-norm minimization problems. However, the constraint that y ≥ Dsx element-wise allows us to look only at the non-negative orthant , over which the absolute value function R0+ is dierentiable. Using this, we use a Lagrangian relaxation approach to re-frame the optimization function as  sum log sum log &  ||Dsx − y||1 + λ1 (− (x)) + λ2 (− (y − Dsx)) x ≥ 0 y ≥ Dsx F (x, λ) = ∞ x < 0 (4)  ∞ y < Dsx

where Ds is the subset of the dictionary being examined, λ1 is the Lagrange multiplier for the non-negativity constraint, and λ2 is the Lagrange multiplier for the element-wise constraint y ≥ Dsx. The gradient of F (x, λ) is, with some abuse of notation,  1 D [i, :] 0 sgn P|s| Pn s &  Ds (Dsx − y) − λ1 i + λ2 i x 6= 0 y 6= Dsx  x[i] (y[i] − Ds[i, :] · x)  D [i, :]  0 sgn Pn s & Ds (Dsx − y) + λ2 i x = 0 y 6= Dsx (5) ∇F (x, λ) = (y[i] − Ds[i, :] · x)  P|s| 1 &  −λ1 i x 6= 0 y = Dsx  x[i]  0 x = 0 & y = Dsx The abuse of notation is that the piece-wise constraints for the gradient change the formula per element of x, whereas if any of the elements of x fail to satisfy the constraints in equation 4 the entire function evaluates to ∞. Therefore, if x = [1, 0] and y > Dsx the gradient will be

" n n # X Ds[i, 1] X Ds[i, 2] D [:, 1] · sgn(D [:, 1] − y) − λ + λ , D [:, 2] · sgn(−y) + λ . s s 1 2 (y[i] − D [i, 1]) s 2 y[i] i s i

Using the formulas in equations 4 and 5, we can easily devise a gradient descent algorithm.

The function in equation 4 is convex6 but not strongly convex when x is a feasible solution to the original optimization problem, so we get a convergence rate of 1 , assuming a small enough step-size parameter, O(  ) where  is the desired accuracy of the solution. All function evaluations within this algorithm take O(m2) at worst, so the running time of algorithm 1 when utilizing algorithm 2 is

O(nkm2−1 + nm3) (6)

4The same algorithm is used for all values of k, though the algorithm can accept k as a parameter. k 5 22 Any problem with an algorithm having time-complexity of the form O(n ) is in XPuniform. 6 The `1 norm and logarithm are both convex functions, and convex functions are closed under addition.

5 Algorithm 2 Gradient Descent for `1 residual minimization

Precondition: Ds is an |s| by n subset of a dictionary D with |s| < n, λ is a pair of Lagrange multipliers, α is a step-size parameter, δ is a step-size adjustment factor, and  is a stopping criteria

1: function GradientDescent(Ds, y, λ, α, δ, ) 2: x ← 0|s| . 0|s|: 0 vector of length |s| 3: g ← 0|s| 4: τ ← ∞ 5: while τ >  do 6: gprev ← g 7: g ← ∇F (x, λ) 8: x ← x − gα 9: τ ← ||g||2 10: if |τ − ||gprev||2| < α then 11: α ← δα 12: return x

Empirical testing indicates that good starting values for λ, α, δ, and  are [0.1, 0.1], 0.125, 0.8, and 0.001 respectively. In an actual implementation of algorithm 2 it is advised to implement a maximum number of iterations, as the accuracy vs speed trade-o is questionable near the minimum.

3 Performance Analysis

While the asymptotic performance of algorithm 1 is exponential in terms of n and k, the performance on specic problem instances is more nuanced. We shall make two basic assumptions about the dictionary: rst, it does not contain the 0 vector, and second all atoms are pair-wise independent. The 0 vector is not relevant in NNSA contexts, any dictionary atom generating no signal can't be distinguished in the rst place, and dictionary atoms that are rescalings of one another cannot be quantitatively distinguished, so we rule out these two cases as irrelevant.

3.1 Combinatorial Bounds

The optimal scenario is that the k atoms from the dictionary that are present in the measured signal are orthogonal to all other atoms in the dictionary7, and that the coecients are such that only one atom will be considered viable at each step. In this scenario, the algorithm will perform a pass over all n atoms of the dictionary to determine the optimal rst component, then use the bounds determined in that step to rule out all but one component in each of the k−1 subsequent passes, requiring only nm3+(k−1)m2−1 operations.

This scenario is actually realized when, assuming the atoms comprising y form an orthogonal subset of ||yi−1||1 2 8 D, if the ith atom, in descending order of `1 norm explained, explains ( ) except for the nal atom, ||y||1 3 which explains all of the remaining signal at that point, algorithm 1 will have only one viable atom to consider at every stage after the rst.

The actual worst case scenario, where every atom satises the minimum bounds at each step of the al- gorithm, can be ruled out if the measured signal is guaranteed to be composed of atoms of the dictionary and the dictionary has certain properties. We start with the simplest case that allows us to rule out worst case performance.

Theorem 3.1. If the dictionary D contains an orthogonal basis of dimension m⊥ and the measured signal y is composed of k dictionary atoms, then at least m⊥ − k − 1 atoms will be ruled out. 7Meaning that no other atom in the dictionary is a linear combination of any atom present in the measured signal. 8 ||yi||1 denotes the `1 norm residual when the i best tting atoms are included in the solution

6 Proof. By denition, dictionary atoms that are orthogonal to the measured signal will have a dot-product of 0 with the measured signal and so have optimal coecients of 0, as will any dictionary atoms that are linear combinations of those m⊥ − k − 1 atoms. Thus the initial step of algorithm 1 will rule out all of those dictionary atoms.

When n >> m⊥ this is a very weak bound that provides little in the way of performance guarantees. However, we can use the constraints x ≥ 0 and y ≥ Dx to derive stronger bounds. Taken together, these constraints imply that any dictionary atom that has a non-zero value where the measured signal is 0 will have an optimal coecient of 0.

Theorem 3.2. If the mean `0 norm of dictionary atoms is 0 ≤ z ≤ m and the number of 0 values in the measured signal is 0 ≤ w ≤ m, then assuming the likelihood that each individual index of the dictionary atoms wz and measured signal is non-zero conforms to a Bernoulli distribution, then P (D[i, j] > 0 | y[j] = 0) = . m2

Proof. Given two independent probabilities, P (A | B) = P (A)P (B). If E(||D[i, :]||0) = z and P (D[i, j] = 0) is independent from when , then z . Similarly, if then P (D[i, k] = 0) j 6= k P (D[i, j] > 0) = m m − ||y||0 = w w . As we are assuming that the distribution of non-zero values within the dictionary atoms P (y[j] = 0) = m and measured signal are independent, P (D[i, j] > 0 | y[j] = 0) = P (D[i, j] > 0)P (y[j] = 0), which completes the proof. wz Corollary 3.2.1. E(g(D, y)) = n( ), where g(D, y) is the number of atoms of D that have a non-zero m2 value where y has a zero value. Proof. Follows immediately from the assumption that the distributions of non-zero values per dictionary atom and the measured signal are both independent and follow a Bernoulli distribution with mean q = m − z. wz Even if is small, if n >> m there is a strong likelihood that E(g(D, y)) > m, giving a tighter bound m2 than theorem 3.1. That said, in practical applications, it is unlikely that the probability of each individual index of a dictionary atom being non-zero is identical. Additionally, it may be that the probability of each index being non-zero are not independent; a more detailed analysis of probabilities in such a circumstance requires knowledge of the context in which the data was generated.

3.2 Bounds For Spectroscopic Applications In most spectroscopy applications signals are rarely point-like. Such signals are commonly modeled as Cauchy, Gaussian, or some combination of the two[7][14]. In order to determine bounds in these conditions we will use the probability density function (pdf) and cumulative density function (cdf) of the Cauchy and Gaussian distributions to estimate the likelihood that an index is non-zero based on the location of the

observed peak maximum and the modeled width of the peak.. For a Cauchy distribution with location x0 and scale γ, the pdf is 1 ϕC (x) = (7) x−x0 2 πγ(1 + ( γ ) ) and the cdf is 1 x − x 1 Φ = arctan( 0 ) + (8) C π γ 2 For a Gaussian distribution with mean µ and variance σ2 the pdf is

1 − 1 ( x−µ )2 ϕG(x) = √ e 2 σ (9) σ 2π and the cdf is 1 x − µ ΦG = [1 + erf( √ )] (10) 2 σ 2 The mapping of the pdf and cdf values to indices depends on the meaning of the indices themselves. Typ- ically, if the dictionary atoms and measured signal are spectroscopic data, the indices will be mapped to a

7 wavelength of light, with the resolution of the instrument used to collect the data determining both the di- mension of the data and the distance in wavelength between each index[1]. For this analysis we shall assume a uniform index to wavelength mapping of 5I 1 where is the numerical index ranging from to . W (I) = 6 − 6 I 1 m

Given that neither equation 7 nor 9 ever actually evaluate to 0, but spectral signals do reach a point where they are undetectable by the instrument being used to measure them, we shall truncate the peak at a suitable point. If the full width at half max (FWHM) of a peak is 1nm, we shall set any values further than 2nm away from the peak center to be 0 (ignoring noise). We shall further assume that any negative values in a dictionary atom or the measured signal are set to 0, as negative values are unphysical in this context. Assuming the noise present in each dictionary atom follows a Gaussian distribution with mean 0 and standard deviation 2 , we can determine the probability that a given index in a dictionary atom is . σn 0

We begin the analysis with a lemma.

Lemma 3.3. For a dictionary atom or measured signal ν with either a single peak or multiple peaks that do not overlap, and a peak center at index x0,

( −ϕ (W (j)) 1 [1 + erf( C √ )] |W (j) − W (x )| ≤ 2 2 σ 2 0 P (ν[j] > 0) = 1 2 |W (j) − W (x0)| > 2

Proof. When |W (j) − W (x0)| ≤ 2 the probability that ν[j] > 0 is equivalent to the probability that the noise is both negative and equal or larger in magnitude to the pure signal of that index. P (N(0, σ2) ≤ 1 −ϕ (W (j)) −ϕ (W (j))) = Φ (−ϕ (W (j))) = [1 + erf( C √ )]. When |W (j) − W (x )| > 2 we are truncating the C G C 2 σ 2 0 peak to 0, meaning that the probability that the observed value of ν[j] being 0 is equal to the probability 1 1 that the noise is less than or equal to 0. P (N(0, σ2) ≤ 0) = Φ (0) = [1 + erf( √0 )] = . G 2 σ 2 2 −1 −1 Corollary 3.3.1. Over a single, non-overlapping peak region with a = W (W (x0)−1) and b = W (W (x0)+ b −ϕ (W (j)) 1) P (ν[a : b] > 0) = Q 1 [1 + erf( C √ )]. j=a 2 σ 2

−ϕ (W (j)) Proof. Follows immediately from P (ν[j] > 0) = 1 [1 + erf( C √ )] when |W (j) − W (x )| ≤ 2. 2 σ 2 0

Theorem 3.4. Given two independent dictionary atoms or measured signals ν1 and ν2 each with a single peak or multiple peaks that do not overlap,

 1 −ϕC (W (j)) 1 −ϕC (W (j)) (1 − [1 + erf( √ )])( [1 + erf( √ )]) (|W (j) − W (x0 )| ≤ 2) | (|W (j) − W (x0 )| ≤ 2)  2 σ 2 2 σ 2 ν1 ν2  −ϕ (W (j))  1 1 C √ 2 − 4 [1 + erf( )] (|W (j) − W (x0ν )| ≤ 2) | (|W (j) − W (x0ν )| > 2) P (ν1[j] = 0 | ν2[j] < 0) = σ 2 1 2 1 1 −ϕC (W (j))  − [1 + erf( √ )] (|W (j) − W (x0 )| > 2) | (|W (j) − W (x0 )| ≤ 2)  2 4 σ 2 ν1 ν2  1 (|W (j) − W (x )| > 2) | (|W (j) − W (x )| > 2) 4 0ν1 0ν2

1 −ϕC (W (j)) Proof. By lemma 3.3, we know that P (ν1 = 0) = 1 − [1 + erf( √ )] when |W (j) − W (x0 )| ≤ 2, and 2 σ 2 ν1 1 −ϕC (W (j)) that P (ν2 > 0) = [1 + erf( √ )] when |W (j) − W (x0 )| ≤ 2. As ν1 and ν2 are independent, 2 σ 2 ν2

1 −ϕC (W (j)) 1 −ϕC (W (j)) P (ν1 = 0 | ν2 > 0) = P (ν1 = 0)P (ν2 > 0) = (1 − [1 + erf( √ )])( [1 + erf( √ )]) 2 σ 2 2 σ 2 when |W (j) − W (x )| ≤ 2 and |W (j) − W (x )| ≤ 2, which proves the rst case. 0ν1 0ν2

Reusing lemma 3.3, when |W (j) − W (x )| > 2, P (ν > 0) = 1 . Therefore, 0ν1 1 2

1 1 −ϕC (W (j)) 1 1 −ϕC (W (j)) P (ν1 = 0 | ν2 > 0) = P (ν1 = 0)P (ν2 > 0) = (1 − [1 + erf( √ )]) = − [1 + erf( √ )] 2 2 σ 2 2 4 σ 2 which proves cases two and three. The nal case follows simply from the lemma as well.

8 3.3 Possible Pitfalls In spectroscopic contexts, peak-shifting and distortion is a source of diculties in analysis. Peak shifts can be caused by a variety of physical and chemical phenomena[10][18][11], and can be accounted for using several types of methods. When time is not a factor and the nature of the sample is known, manual analysis of the collected spectra can deal with these issues. However, when the nature of a sample is unknown and the dictionary being used is large, it is entirely possible that there will be multiple dictionary atoms with similar or identically shaped peaks separated by very few wavenumbers. In this case, such peak shifts may cause misidentication of sample components, particularly if the components have only one peak, and algorithm 1 may break down.

Peak shifts of this type will cause similar problems in most algorithms typically used for sparse approx- imation applications, so this is a major issue that will have to be addressed in a preprocessing step for any such algorithm to be successful. Chatzidakis and Botton [3] propose a deep learning based framework to promote calibration free 9 analysis of spectra, but to the best of our knowledge this has not been tested in an industrial context.

Another major driver of computational expense for algorithm 1 is the proportion of the atoms compris- ing the measured signal. Recall equation 2, which states that the optimal subset S of D must account for at least |S| of the norm of the measured signal . If all atoms comprising are in equal proportion then k `1 y y the bound in equation 2 is exact and there will be a minimum of k elements remaining at each stage of the algorithm.

Furthermore, when there is a large disparity between the amount of the measured signal that is comprised by the dictionary atoms, the choice of stopping criteria for the algorithm becomes more important. If the stopping criteria is too permissive it is possible that a dictionary atom comprising a small amount of the measured signal will be missed, whereas if the criteria is too strict the algorithm will spend additional time attempting to t atoms to what is likely to be noise (or signal too small to be distinguished from noise). This makes it crucial for an accurate estimate of the signal to noise ratio to be determined if the identication of small magnitude components of the measured signal is important to the application.

4 Computational Tests

A dictionary with 1000 atoms of dimension 500, each generated as combinations of gaussian peaks of varying mean and standard deviation, was randomly generated (see gure 1). The atoms were then scaled to have an

`2-norm of 1, and tested to ensure that no atoms were exact linear scalings of any other atom. The coherence parameter µ is 0.999817401463998, the Babel function10 value for m = 2 is 1.99442536272711, meaning that for signals composed of greater than one dictionary atom the conditions specied in [16] are not satised, and convex relaxations of the sparseness condition are not guaranteed to nd the optimal solution.

The dictionary atoms were generated this way in order to mirror the sort of spectrum library present when performing NNSA with a device utilizing raman or mass spectroscopy, where peaks are roughly gaussian in shape but have varying widths and relative magnitudes.

Testing was performed on a Dell Precision 7520 laptop with a 3.10GHz Intel Xeon E3-1535M processor and 64GB of RAM.

The pure signal used for this test was generated by randomly choosing 5, shown in gure 2a, atoms of D and randomly generating mixing coecients between 0.01 and 100. The mixing coecients used to generate the pure signal were [35.0079, 28.7156, 92.7495, 5.1409, 59.2708] for dictionary atoms [745, 893, 243, 130, 226]

9Without dependence on the index to wavelength mapping discusses in the previous section 10See [16] for the denition of this function.

9 0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0 0 50 100 150 200 250 300 350 400 450 500

Figure 1: Sampled entries of dictionary D respectively.

0.35 25 745 745 893 893 0.3 243 243 130 20 130 226 226 0.25

15 0.2

0.15 10

0.1

5 0.05

0 0 0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500

(a) Dictionary atoms that make up pure signal (b) Dictionary atoms scaled with mixing coecients

Figure 2: Unscaled and scaled pure signal components

After multiplying by the mixing coecients, the selected dictionary atoms are shown in gure 2b. The smallest contribution is clearly made by dictionary atom 130, which accounts for less than 2.5% of the total signal present. Correspondingly, any appreciable amount of noise will make this signal very dicult to detect.

10 25 25

20 20

15 15

10 10

5 5

0 0 0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500

(a) Pure signal (b) Measured signal

Figure 3: Pure and measured signal

The "measured" signal was generated by calculating the mean of the pure signal (gure 3a) and adding gaussian noise with a mean of 0 and a standard deviation of the mean of the pure signal divided by 10, shown in gure 3b. Visually, while this level of noise does not appear to impact the primary features present in the signal, it does mean that some of the subtler features are noticeably obsurced.

As can be seen in gures 4 and 5, both the LASSO reconstruction and the BnB reconstruction are smoother

Method Parameters Median Runtime (1000 runs) Unexplained Signal LASSO (IRLS) tol = 0.001 1.33s 0.027308% LASSO (IRLS) tol = 0.00001 6.154s 0.0271% BnB 1 2 3 4 , , 35.927s 0.0092 t = [ 5 , 5 , 5 , 5 ]  = −0.1 ρ = −2.5 % BnB 3 5.5 4 9 , , 4.087s 0.0092 t = [ 10 , 10 , 5 , 10 ]  = −0.1 ρ = −2.5 %

Table 1: Results of computational tests than the "measured" signal, appearing closer to the form of the pure signal. However, it is clear from gure 5a that the LASSO reconstruction contains deviances from the pure signal that appear to be the result of attempting to account for the noise present in the measured signal. Conversely, the BnB reconstruction does not contain similar deviances.

25 Pure Signal Noisy Signal LASSO 20 BnB

15

10

5

0 0 50 100 150 200 250 300 350 400 450 500

Figure 4: Comparison of pure signal, noisy signal, LASSO reconstruction, and BnB reconstruction

The root of this behavior can be seen in gures 6a and 6b, which compare the reconstruction coecients of the two methods. The LASSO method has no explicit constraint on the number of non-zero coecients,

11 instead relying on the underlying mathematical properties of the data to guarantee that the convex relaxation of the `0-norm constraint will result in a minimally sparse set of coecients. As this test instance does not have the necessary mathematical properties for this guarantee to hold, the LASSO reconstruction contains several extraneous coecients that are a result of the inuence of the noise present in the "measured" signal.

Pure Signal 10 Pure Signal 2 Noisy Signal Noisy Signal LASSO 9 LASSO 1.8 BnB BnB 8 1.6 7 1.4

1.2 6

1 5

0.8 4

0.6 3

0.4 2 0.2 1 0 120 130 140 150 160 170 180 85 90 95 100 105 110 115 120

(a) Low signal area of pure signal, noisy signal, LASSO (b) High signal area of pure signal, noisy signal, LASSO reconstruction, and BnB reconstruction reconstruction, and BnB reconstruction

Figure 5: Comparison of measured, pure, and reconstructed signals

However, due to the fact that the BnB reconstruction has explicit constraints on the number of non-zero coecients, there are no extraneous non-zero coecients present in the reconstruction, and therefore a much smaller amount of deviation from the pure signal, despite the presence of noise.

100 BnB Coefficients BnB Coefficients LASSO Coefficients 10 LASSO Coefficients 80

8

60 6

40 4

20 2

0 0

-2 -20 0 100 200 300 400 500 600 700 800 900 1000 100 110 120 130 140 150 160 170 180 190

(a) Comparison of BnB and LASSO coecients (b) Discrepancy between BnB and LASSO coecients

Figure 6: Comparison of predicted coecients

Furthermore, the LASSO method incorrectly assigns roughly half of atom 130's contribution to atom 151, demonstrating a potential pitfall of convex relaxations when the optimality conditions are not met components with small relative magnitudes may be improperly accounted for.

5 Conclusions and Future Research 5.1 Conclusions Given the optimality guarantees described in section 2, the performance bounds given in section 3, and the results of computational tests described in section 4, it is clear that the algorithm presented in this article is a strong tool for solving NNSA problems when the solution is required to be very sparse. This is particularly

12 true in contexts where robustness to noise and ability to identify smaller components of the measured signal are paramount.

As the guarantee of optimality rests on simpler assumptions than convex relaxation based methods - namely that all components of the measured signal be present in the dictionary and that the dictionary atoms and measured signal are non-negative - algorithm 1 is broadly applicable, even when the dictionary being used has otherwise undesireable properties. Furthermore, ne-grained performance analysis can be performed based on basic statistical properties of the dictionary being utilized, in contrast to convex relaxation which relies on properties of the matrix which are computationally intensive to determine for large dictionaries.

In a practical sense, this algorithm is both relatively simple to implement and relies on only one black box module (a LAPACK implementation) for functionality. An implementation in Julia is available for testing and evaluation here. A C++ implementation, which is noticeably faster, using the Eigen framework was created but is not available publically at this time.

While the non-negativity requirements may seem restrictive, analysis of physical quantities almost always results in non-negative data, providing numerous elds where the NNSA problem is potentially important to solve. Natural Language Processing (NLP) is one possible eld for which the sparse approximation problem is relevant and the non-negativity constraints are generally satised. Algorithm 1 may be a powerful tool there, but we do not explore this possibility in detail here.

5.2 Future Research

It is currently open whether or not the complexity classes FPT and XPuniform are equal. This is generally considered unlikely that they are equal due to the consequence that if FPT = XPuniform then EPTAS = PTAS [2]. In section 2 we established that algorithm 1 places the NNSA problem in XPuniform, meaning that a proof that the NNSA problem is not in FPT would immediately imply a separation between the two classes (and furthermore between EPTAS and EPTAS).

A related question concerns the relationship between NNSA and the general SA problem. Both problems are NP-Hard, but to the best of our knowledge it is open to show that SA is not reducible to NNSA in general. It seems that the addition of non-negativity constraints make NNSA an easier problem than the general SA problem, but as of now this is only conjecture. If SA can be reduced to NNSA then clearly SA is in XPuniform, but it is unclear what, if any, additional consequences would arise from this.

An obvious next question would be whether or not algorithm 1 can be extended to solve the general SA problem. Unfortunately, as the analysis used to determine the bounds on how much information a dictio- nary atom must explain in order to be a possible candidate for the optimal subset relies heavily on the non-negativity constraint, so any extension of the algorithm would likely require an entirely new procedure for determining said bounds. Even the possibly simpler question of whether or not the branch and bound scheme can be adapted to handle the situation where non-negativity is required for the measured signal and dictionary atoms but not the coecients would also require new methodology.

The nal potential folow-up question addressed here is whether or not 1 is optimal for the NNSA prob- lem. A direct proof of this would imply P 6= NP, so such a proof is unlikely to be attainable in the near-term. However, what interesting consequences can be derived from the assumption that the NNSA problem lies in smaller complexity classes than XPuniform? For instance, if NNSA lies in a given level of the W hierarchy, are separations or inclusions currently considered unlikely provable consequences? Answers would provide evidence for whether or not algorithm 1 can be substantially improved upon for optimally solving the NNSA problem.

13 References

[1] David A. Carter and Jeanne E. Pemberton. Frequency/Wavelength Calibration of Multipurpose Mul- tichannel Raman Spectrometers. Part I: Instrumental Factors Aecting Precision. In: Appl. Spectrosc. 49.11 (Nov. 1995), pp. 15501560. url: http://as.osa.org/abstract.cfm?URI=as-49-11-1550. [2] M. Cesati and L. Trevisan. On the eciency of polynomial time approximation schemes. In: Electronic Colloquium on Computational Complexity (1997). url: https://eccc.weizmann.ac.il//eccc- reports/1997/TR97-001/. [3] M. Chatzidakis and G. A. Botton. Towards calibration-invariant spectroscopy using deep learning. In: Scientic Reports 9 (2019). doi: 10.1038/s41598-019-38482-1. url: https://rdcu.be/b5nU0. [4] Yichen Chen, Yinyu Ye, and Mengdi Wang. Approximation Hardness for A Class of Sparse Opti- mization Problems. In: Journal of Research 20.38 (2019), pp. 127. url: http: //jmlr.org/papers/v20/17-373.html. [5] Patrick L. Combettes and Valérie R. Wajs. Signal Recovery by Proximal Forward-Backward Splitting. In: Multiscale Modeling And Simulation 4.4 (2005), pp. 11681200. doi: 10.1137/050626090. url: https://epubs.siam.org/doi/10.1137/050626090. [6] G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. In: Constr. Approx 13 (1997). doi: 10.1007/BF02678430. [7] Jack Dodd and Lin Denoyer. Curve-Fitting: Modeling Spectra. In: Aug. 2006. isbn: 9780470027325. doi: 10.1002/0470027320.s4503. [8] D. L. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization. In: Proceedings of the National Academy of Science 100 (Mar. 2003), pp. 2197 2202. doi: 10.1073/pnas.0437847100. [9] R. G. Downey and M. R. Fellows. Provable Intractability: The Class X P. In: Parameterized Complex- ity. New York, NY: Springer New York, 1999, pp. 341350. isbn: 978-1-4612-0515-9. doi: 10.1007/978- 1-4612-0515-9_15. url: https://doi.org/10.1007/978-1-4612-0515-9_15. [10] Bernadine M.Flanagan Frederick J.Warren Michael J.Gidley. Infrared spectroscopy as a tool to char- acterise starch ordered structurea joint FTIRATR, NMR, XRD and DSC study. In: Carbohydrate Polymers 139 (2016). doi: 10.1016/j.carbpol.2015.11.066. [11] K. W. R. Gilkes et al. Direct observation of sp3 bonding in tetrahedral amorphous carbon using ultraviolet Raman spectroscopy. In: Applied Physics Letters 70.15 (1997), pp. 19801982. doi: 10. 1063/1.118798. eprint: https://doi.org/10.1063/1.118798. url: https://doi.org/10.1063/1. 118798. [12] Patrick R. Gill, Albert Wang, and Alyosha Molnar. The In-Crowd Algorithm for Fast Basis Pursuit Denoising. In: IEEE Transactions on 59.10 (2011), pp. 45954605. doi: 10.1109/ TSP.2011.2161292. url: https://ieeexplore.ieee.org/document/5940245. [13] Saralees Nadarajah. Making the Cauchy work. In: Braz. J. Probab. Stat. 25.1 (Mar. 2011), pp. 99 120. doi: 10.1214/09-BJPS112. url: https://doi.org/10.1214/09-BJPS112. [14] J. Pitha and R. Norman Jones. A COMPARISON OF OPTIMIZATION METHODS FOR FITTING CURVES TO INFRARED BAND ENVELOPES. In: Canadian Journal of Chemistry 44.24 (1966), pp. 30313050. doi: 10.1139/v66-445. eprint: https://doi.org/10.1139/v66-445. url: https: //doi.org/10.1139/v66-445. [15] Robert Tibshirani et al. Sparsity and Smoothness via the Fused Lasso. In: Journal of the Royal Statistical Society. Series B (Statistical Methodology) 67.1 (2005), pp. 91108. issn: 13697412, 14679868. url: http://www.jstor.org/stable/3647602. [16] Joel A. Tropp. Greed is Good: Algorithmic Results for Sparse Approximation. 2004.

14 [17] Joel A. Tropp and Anna C. Gilbert. Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit. In: IEEE Transactions on Information Theory 53.12 (2007), pp. 46554666. doi: 10.1109/tit.2011.2173241. url: http://users.cms.caltech.edu/~jtropp/papers/TG07- Signal-Recovery.pdf. [18] Yan Wang, Daniel C. Alsmeyer, and Richard L. McCreery. Raman spectroscopy of carbon materials: structural basis of observed spectra. In: Chemistry of Materials 2.5 (1990), pp. 557563. doi: 10. 1021/cm00011a018. eprint: https://doi.org/10.1021/cm00011a018. url: https://doi.org/10. 1021/cm00011a018. [19] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2 (2005), pp. 301320. doi: 10.1111/j.1467- 9868.2005.00503.x. url: https://rss.onlinelibrary.wiley.com/doi/ abs/10.1111/j.1467-9868.2005.00503.x.

15