1

Fast Convolutive Nonnegative Matrix Factorization Through Coordinate and Block Coordinate Updates Anthony Degleris, Benjamin Antin, Surya Ganguli, Alex H Williams

Abstract—Identifying recurring patterns in high-dimensional only describe such sequences through multiple latent factors. time series data is an important problem in many scientific do- At worst, including additional factors to fit such structure may mains. A popular model to achieve this is convolutive nonnegative result in overfitting. matrix factorization (CNMF), which extends classic nonnegative matrix factorization (NMF) to extract short-lived temporal motifs Convolutive NMF (CNMF) is a simple extension of the from a long time series. Prior work has typically fit this NMF model that overcomes these shortcomings. As its name model by multiplicative parameter updates—an approach widely suggests, CNMF introduces convolutional structure into the considered to be suboptimal for NMF, especially in large-scale low rank model reconstruction, and thus captures short-term data applications. Here, we describe how to extend two popular temporal dependencies in time series data [8], [9]. The CNMF and computationally scalable NMF algorithms—Hierarchical Al- ternating Least Squares (HALS) and Alternatining Nonnegative model has been effective in a variety of applications, including Least Squares (ANLS)—for the CNMF model. Both methods neuroscience [10], medical data mining [11], and audio signal demonstrate performance advantages over multiplicative updates processing [12]. on large-scale synthetic and real world data. In recent years, algorithms for NMF have matured to a stage Index Terms—Convolutive nonnegative matrix factorization, where it is computationally tractable to fit very large datasets hierarchical alternating least squares, alternating nonnegative [13], [14]. However, the CNMF model cannot be viewed as a least squares, coordinate descent special case of NMF, and thus these algorithmic improvements are not immediately transferable to CNMF. As a result, while I.INTRODUCTION many high-performance and computationally scalable code NMF models a matrix of nonnegative data, X, as the packages are available for NMF [15], [16], algorithms and product two low rank and nonnegative matrices, thus ap- implementations of CNMF are less mature. proximating each datapoint (a row or column of X) as a The Multiplicative Update (MU) algorithm appears to be conical combination of basis features or latent factors [1], [2]. the most common optimization routine for CNMF in published When the low rank assumption is appropriate, NMF often literature [8]–[10], [12]. This method was originally developed yields highly interpretable descriptions of data and thus is for NMF [1], [17], and later adapted to CNMF [8], [9]. How- a highly effective tool for exploratory data analysis. NMF ever, subsequent work found MU to be relatively ineffective has been applied to high-dimensional time series data, with for NMF [2], suggesting that MU may also be suboptimal applications ranging from audio processing, image processing, for CNMF. Here we derive two new algorithms for the neuroscience, and text mining [2]–[5]. CNMF model—Hierarchal Alternating Least Squares (HALS) However, many time series contain short-term temporal and Alternating Nonnegative Least Squares (ANLS)—both of correlations or sequences of events that are not approximately which can be understood as extensions of successful NMF low rank, and thus cannot be extracted by NMF. For example, algorithms [2] and are special cases of coordinate and block audio recordings are typically represented and visualized as coordinate descent [18], [19]. We show that HALS and ANLS spectrograms, which display the frequency content of sound outperform MU on CNMF models fit to large-scale data. arXiv:1907.00139v1 [cs.LG] 29 Jun 2019 over time as the signal varies. Many sounds of interest have Additionally, we derive several reformulations of the CNMF recognizable signatures in the frequency domain, which are not objective function which lead to new and useful interpretations low rank—e.g. phonemes in human speech data may slightly of the model. change in frequency (pitch) over their production interval. Similarly, in time series data from neuroscience, it is common II.BACKGROUND to find clusters of brain cells that fire in a rapid sequence A. Notation [6], [7]. NMF could efficiently model these firing events if P neurons fired simultaneously; however, sparse sequences of We denote a vector with P real-valued entries as x ∈ R , P ×Q neural firing yield high rank data matrices. At best, NMF can a P × Q matrix as X ∈ R , and a P × Q × R tensor (in this paper, a tensor is an array with three indices) as This work received support from the Department of Energy Computational X ∈ RP ×Q×R. If a matrix (or vector or tensor) has strictly Science Graduate Fellowship (CSGF) program, the Burroughs Wellcome P ×Q Fund, the Alfred P. Sloan Foundation, the Simons Foundation, the McKnight nonnegative entries, we write X ∈ R+ , or alternatively Foundation, the James S. McDonell Foundation, and the Office of Naval X ≥ 0. Research. We denote the ith slice of the tensor X along its first mode The authors are with the Departments of Electrical Engineering (A.D., B.A.), Applied (S.G.), and Statistics (A.H.W.), , as Xi::, which indicates the first index is fixed to i while the Stanford, CA 94305 USA (e-mail: [email protected]). rest remain free. In our example above, Xi:: refers to a matrix 2

(a) (b) Fig. 1. Schematic illustration of NMF and CNMF models fitting the same dataset. (a) NMF models a data matrix X (lower right) the product of W (matrix with K = 5 columns, left) and H (matrix with K = 5 rows, top). In the “sum-of-outer-products” interpretation of the model, each column of W represents a group of simultaneously activated features, while the corresponding row of H represents the times at which this group of features is active. (b) CNMF extends this “sum-of-outer-products” interpretation using a convolution operator instead of a vector outer product. Here, the same data are modeled using a tensor W (tensor with K = 2 slices, left) and H (matrix with K = 2 rows, top). Each of the K slices of W can be thought as a spatiotemporal feature of temporal duration L, and the times at which each such feature is convolutionally activated are specified by the corresponding row of H. This structure enables a much more compact and interpretable representation of this example time series.

of size Q×R, whereas X::i refers to a matrix of size P ×Q. In roughly approximate X. Formally, the NMF problem is: the following sections, we overload the notation W = W ` `:: minimize kX − WHk2 to refer to the matrix created by fixing the index of the tensor W,H (1) W along the first mode. Concretely, if W is a tensor of size subject to W ≥ 0, H ≥ 0. L × N × K, then W` is a matrix of size N × K. This model can be reformulated as a sum of outer products. The symbol refers to element-wise multiplication of T T Letting w1, ..., wK be the columns of W and h1 , ..., hK be matrices, i.e. (A B)ij = AijBij (Hadamard product). A the rows of H, the objective is: Similarly, the notation B refers to element-wise division. In all cases, the norm of a vector, matrix, or tensor is defined as K I×J×K X T 2 the root sum-of-squares. For a tensor X ∈ R , this is minimize kX − wkhk k w1,...,wk (2) h1,...,hk k=1 1/2  I J K  subject to wk ≥ 0, hk ≥ 0 ∀k ∈ {1, 2, ..., K}. X X X 2 kX k =  X ijk . The NMF model can be effective when X is nonnegative i=1 j=1 k=1 and approximately low rank. However, NMF may perform The symbol ⊗ refers to the Kronecker product between two poorly as a feature extraction method when X contains matrices. If A ∈ Rm×n and B ∈ Rp×k, then the Kronecker short-lived temporal motifs with high-rank structure. This is product A ⊗ B ∈ Rmp×nk is demonstrated schematically in Figure 1a. The data matrix X T   represents a time series with measurements, with each A11B ... A1nB column representing a single measurement of N variables.  . .. .  N A ⊗ B =  . . .  . For example, could be the number of frequency bins in a spectrogram representation of an audio signal [9], or the Am1B ... AmnB number of recorded cells in a neural time series [10]. These If the matrix A has columns a1,..., an, then the vector- time series can contain short-lived patterns that are not low- ization of A is defined as rank (in Fig. 1, two different recurring patterns are shown in a  shades of red and green). This results in NMF requiring many 1 dimensions (i.e. a large choice for K) to capture the structure  .  vec(A) =  .  . in the data. This hampers interpretability as visible patterns in an the data are split across multiple factors. The convolutive NMF (CNMF) model was developed to ad- When discussing update rules that solve the NMF and K×T dress this shortcoming [8]. CNMF finds a matrix H ∈ R+ CNMF problem, we use superscripts to denote the iteration L×N×K (i) and a tensor W ∈ R+ that minimizes the following number of the algorithm. For example, W refers to the objective: matrix W at the ith iteration of an algorithm or update scheme, (i+1) L W refers to the next iterate, and so on. X 2 minimize kX − W`HS`−1k W,H (3) `=1 B. The NMF and CNMF models subject to W ≥ 0, H ≥ 0, N×T Given a data matrix X ∈ R+ , NMF attempts to find two where S` is a T × T column right-shift matrix, defined as N×K K×T nonnegative factor matrices W ∈ R+ and H ∈ R+ that a matrix with ones along the `th upper diagonal and zeros 3

otherwise. If ei denotes the ith standard basis vector, then producing several effective heuristic algorithms [2] and condi- T T ei S` = ei+` when i + ` ≤ T . A visual demonstration makes tions guaranteeing an exact solution in polynomial time [21], the role of S` clear: [22]. One such heuristic algorithm for NMF is the Multiplicative 1 2 3 4 0 1 2 3 A = , AS = Update (MU) algorithm. The MU algorithm repeatedly updates 5 6 7 8 1 0 5 6 7 W and H according to the following update rule [17] 0 0 1 2 AS2 = ,... 0 0 5 6 XH(i)T W(i+1) W(i) , (9) =  (i) (i) (i)T When L = 1, CNMF reduces exactly to NMF. Like ordinary W H H NMF, CNMF also has a natural “sum of outer products” form. where the index i refers to the current iteration of the algo- If we consider the slices W , ..., W ∈ L×N and the ::1 ::K R+ rithm. By the symmetry of the NMF problem (the objective row vectors hT,..., hT of the matrix H, we can define the 1 K can be expressed as kXT − HT WT k), the same update rule convolution operator ∗ by can be applied to H. In reality, the MU algorithm is actually L just gradient descent with per-parameter scaling factors [18]. T T X A = W::k ∗ hk , Ant = W`nkHk,t−`. (4) Its popularity stems from several desirable properties—the `=1 update rule is monotonic, simple to implement, and preserves which can alternatively be written as nonnegativity. Since NMF is a special case of CNMF (with L = 1), solv- T X ing the latter problem exactly is also NP-hard. Accordingly, WT ∗ hT = H 0 WT 0  . (5) ::k k kτ τ−1 ::k T +1−L−τ heuristic algorithms are also used to fit CNMF, most notably τ=1 a generalization of MU [9]. In this case, the update rules are where 0p signifies p columns of zeros. To make the notation more concise, we abbreviate the zero-padding as follows: X(H(i)S )T W(i+1) = W(i) ` , (10) ` ` (i) (i) T  T   T  Xb (H S`) W::k = 0τ−1 W 0T +1−L−τ (6) τ ::k PL (i)T T (i+1) (i) `=1 W` XS−` T T X  T  H = H , (11) W ∗ h = H W (7) PL (i)T (i) ::k k kτ ::k τ `=1 W` Xb S−` τ=1 (i) PL (i) (i) Using equation 5, we rewrite the CNMF objective as where Xb = `=1 W` H S`−1 is our reconstruction of X. As in the NMF case, MU is easy to implement and has K X T T 2 been applied frequently to fit CNMF. Nevertheless, past work minimize kX − W::k ∗ hk k W,H (8) has shown MU to be suboptimal for fitting NMF compared k=1 to other coordinate descent algorithms [2], [18]. We reasoned subject to W ≥ 0, H ≥ 0. that exploiting similar coordinate and block-coordinate updates N×L would lead to performance benefits in the case of CNMF. Here, each W::k ∈ R represents a short-lived temporal pattern, or motif, that may have full rank. The nonzero entries T of each hk ∈ R represent the times at which this motif occurs. For the idealized time series in Figure 1, CNMF D. Hierarchical Alternating Least Squares (HALS) for NMF pulls out a simpler and more interpretable description of data Hierarchical alternating least squares (HALS) is a coordi- than NMF. In essence, CNMF extracts 2 recurring patterns, nate descent method used to fit NMF [2], [23]. Each update corresponding to K = 2 factors in the model. In contrast, step solves a constrained optimization problem exactly for a NMF requires K = 5 model factors. single column of W or row of H. To update a single column We note here briefly that different boundary conditions wp, we reformulate the NMF objective as could be specified for the convolution operation in eq. 8. We adopted zero-padding for these boundary conditions as 2 K it appears to be the most standard choice in prior literature X T J(W, H) = X − wkhk [9], [10]. Only minor modifications to our exposition would k=1 be needed to handle different choices. For example, H could   2 be re-specified as a K × (T − L) matrix and each S` could X T T = X − wkhk  − wphp (12) be specified as a (T − L) × T matrix to specify convolution k6=p without padding. and fix all variables except for the pth column of W.1 C. Multiplicative Update (MU) Algorithms Minimizing over wp is a convex problem, and the Karush- The objective of the NMF problem (equation 1) is non- 1As in the case of MU, the symmetry of the NMF problem allows us to convex, and finding an exact solution is NP-hard in general use the same rule for H. For a more detailed derivation of the HALS updates [20]. This had led to extensive algorithmic research on NMF, for NMF, see [23]. 4

Kuhn-Tucker (KKT) conditions for optimality generate the Due to this reformulation, it is clear that the HALS update closed-form update rules [23]: rule for NMF extends easily to W. When updating W, we  h (i) (i)Ti (i)  treat H as fixed, and thus we can ignore the linear constraints P ∼ ∼ ∼ X − k6=p wk hk hp N T (i+1) in 16. Letting wp∈ R be the pth column of W and hp∈ R wp = max 0,  . (13) ∼ (i) 2 khp k be the pth row of H, we have the update rule   (i)T (i)  Numerical experiments suggest this update rule notably out- P ∼(i)∼ ∼ X − j6=p wj hj hp performs MU [2]. One possible explanation for this is that ∼(i+1)   w := max 0,  . (17) although both algorithms have a similar flop count to update p ∼(i)  k h k2  all of W, HALS solves many exact problems whereas MU p computes a single, inexact gradient step. This rule allows us to update w`,:,k using p = (l − 1)L + ∼ ∼ k. Note that in practice, the matrices H, W do not need to E. Alternating Nonnegative Least Squares (ANLS) for NMF ∼ be explicitly instantiated. Each wp is simply an array view Another popular approach to the NMF problem is to fix W ∼ into W, and each block matrix HS` comprising H can be or H, and to solve the resulting convex sub-problem exactly. computed on demand. This leads to an algorithm known as Alternating Nonnegative Updating H: Deriving an update rule for H is more com- Least Squares (ANLS), whose updates are: plicated due to the convolutive structure imposed on H. We W(i+1) = arg min kX − WH(i)k2, (14) first consider the outer product form of the CNMF objective W≥0 from equation 8 and expand the convolution operator (i+1) (i+1) 2 H = arg min kX − W Hk . (15) K 2 H≥0 X T T X − W::k ∗ hk Each of these updates amounts to solving a nonnegative least k=1 squares problem, which has been extensively studied in the K T 2 X X optimization literature [24], [25]. Thus, one can leverage X − H WT  = kτ ::k τ existing nonnegative least squares solvers to compute the k=1 τ=1 solution to each sub-problem. A variety of such solvers are 2 E(i) − H WT  , (18) available, including active-set methods, quasi-newton methods, = kt ::k t and projected gradient methods [2]. This motivates us to also WT WT t − extend the ANLS approach to fit CNMF. where [ ::k]t is the matrix ::k padded with 1 columns of zeros on the left and T + 1 − L − t columns of zeros on the right (first defined in (6), and where we define E(i) as III.NEW ALGORITHMSFOR CNMF   A. HALS (i) (i) X (i) h (i)Ti E = X − Hpτ W::p  . (19) In this section, we demonstrate how to extend HALS to fit τ (p,τ)6=k,t the CNMF model, highlighting the key reformulations used in :,t:t+L−1 deriving the update rule. This equation is reminiscent of equation 12, and indeed leads Updating W: Recall the CNMF objective from (3). The to a related update rule. Fixing all variables but a single entry P sum ` W`HS`−1 can be written as a block matrix product Hkt, we can derive the Lagrangian and corresponding Karush- by defining Kuhn-Tucker (KKT) conditions for optimality (see Appendix   B-B). This leads us to a closed form update rule for a single HS0 HS entry of H: ∼   ∼  1  W= W1 W2 ... WL , H=  .  . (i) !  .  Tr(W E(i))  .  H(i+1) , ::k , kt = max 0 (i) (20) HSL−1 2 kW::k k P ∼ ∼ which completes the generalization of HALS to CNMF. Using the fact that ` W`HS`−1 =WH, we can reformulate the CNMF objective as Indeed, when L = 1, the HALS update rule for NMF (eq. ∼ ∼ 13) can be recovered exactly. Specifically, when L = 1, both minimize kX− WH k2 (i) (i) (i) W,H E and W::k reduce to length-N vectors: E is column t of (i) ∼ ∼ th the residual matrix, and W::k is the k low rank factor. Thus, subject to W≥ 0, H≥ 0, (16) (i) (i) ∼ ∼ the numerator term Tr(W::k E ) reduces to a vector inner H`K:(`+1)K,:=H0:K,: S` product. To update an entire row of H at once, as is standard for all ` = {0,...,L − 1}, for HALS in NMF, the numerator term may be extended to ∼ be a matrix-vector product, recovering eq. 13. where the last constraint ensures that H has the block matrix However, when L > 1, updating the full row of H in closed structure described above. This reveals an important fact: the form is not feasible. Specifically, the update rule for Hkt is CNMF approximation is an NMF factorization with linear dependent on the current value of H for all t < τ < t ∼ kτ + constraints on H. L, meaning that one can only simultaneously update every 5

Lth entry in the kth row in H. Thus, there are two potential did in the case of W. With this definition, the ANLS update extensions of HALS for CNMF, when updating H: rule for H is given by • Update H in blocks of size T/L. Iterate over ` = 1,...,L − 1 and, starting at position `, update every Lth H(i+1) = arg min k vec(X) − V vec(H)k2. (23) entry of row k in H. In principle, this could be achieved H≥0 by appropriately truncating and reshaping the residual Thus, we have cast the CNMF optimization problem as an matrix E(i). Alternating Nonnegative Least Squares problem. In practice, • Update single entries of H. Iterate over t = 1,...,T NT ×KT the matrix V ∈ may be too large to fit in memory. and update H by equation 20. Note that one need not R+ kt One way around this is to use a matrix-free method, which compute the full residual matrix; only columns ranging requires access to the matrix V only through it’s matrix vector from t to t+L of E(i) should be computed. This results in product. For example, Projected Gradient Descent and Fast O(NL) total floating point operations to update a single Iterative Shrinkage Thresholding (FISTA) are good candidate entry of H. methods if efficient implementations of Vz and VT z are (i) In both cases, the relevant entries in E should be updated available [24], [25]. In practice, we find that directly solving after each parameter update. The second option listed above (23) at each iteration is inefficient. However, the formulation (pure coordinate descent) is simpler to implement, and thus above leads to two insights. we focused on this variant in our numerical experiments [18]. Noting that V is a block-toeplitz matrix, it becomes clear ` As in HALS for NMF, adding 1 regularization with weight that the update for H is actually a higher-dimensional analogue α α ` amounts to subtracting from the numerator, and adding 2 to the standard nonnegative deconvolution problem studied in β β regularization with weight amounts to adding to the de- the literature [27]. The difference is that here, the coefficient nominator [23]. These extensions to the above algorithm could matrix is block-toeplitz rather than toeplitz. This suggests the be used to identify regularized and sparse CNMF models. As possibility of leveraging the convolutional structure of the we are primarily interested in computational performance, we problem using approaches which have been applied in the did not explore the statistical benefits of such regularization deconvolution case. We leave this to future work. methods in detail. The second insight is that updating a single column of H with the other columns held fixed is simply a Nonnegative B. ANLS Least Squares problem in K variables which does not require In this section, we derive an Alternating Nonnegative Least explicitly storing the matrix V. Therefore, one approach to Squares update rule for CNMF. We’ll make use of two solving (23) is block coordinate descent, updating a single different formulations of the CNMF model, one for the update column at a time. Since block coordinate descent converges to of W and one for the update of H. the optimal solution for Nonnegative Least Squares problems Updating W: If we fix all entries of the matrix H, updating [28], this approach would eventually reach the optimal solution the tensor W using ANLS is straightforward. We recall the for (23). In practice, it is not necessary to exactly solve (23) at formulation from (16), in which we expressed the CNMF each iteration. For the purposes of our numerical experiments, ∼ ∼ model as a product of block matrices: Xb =WH. In this form, we make a single pass of coordinate descent at each iteration ∼ it is clear that we can update the matrix W using an off-the- (updating each column exactly once), using the block-principal ∼ shelf Nonnegative Least Squares solver. Since the matrix W pivoting method described in [29]. is simply a reshaped version of the tensor W, this suffices for updating W. Concretely, we have the following update rule: IV. NUMERICAL EXPERIMENTS ∼(i+1) ∼ ∼(i) W = arg min kX− WH k2. (21) ∼ W≥0 In this section, we compare all three algorithms on synthetic and experimental data. We find that HALS and ANLS both Updating H: The update of H requires us to use a different converge significantly faster than MU, and that their relative formulation of the CNMF model. First, we recall the following performance to MU increases with dataset size. For example, fact (see, e.g., [26]): on a large audio dataset, we find that HALS converges roughly vec(ABC) = (CT ⊗ A) vec(B), (22) five times faster than MU. This effect occurs consistently regardless of random initialization. for any matrices A, B, C (assuming appropriate dimensions). In each figure, we measure reconstruction error (loss) by the This leads us to the following vectorized version of the CNMF scaled norm of the residual, kX − Xb k/kXk. All results are model: obtained via the Julia [30] code using version 1.0, published in L X T the GitHub repository at github.com/degleris1/CMF.jl, which vec(Xb ) = (S`−1 ⊗ W`) vec(H). contains implementations of all algorithms and Jupyter note- `=1 | {z } books to reproduce figures. We use the Sherlock compute clus- V ter at Stanford to run all simulations, using two cores (Broad- When W is fixed, the above equation allows us to update H well) with 16GB of memory per core. In all experiments, all by solving a single Nonnegative Least Squares problem, as we algorithms were given the same random initialization. 6

HALS MULT Loss on synthetic datasets ANLS 0.4 T=500 T=2500 0.3 0.3

0.2 0.2 Loss

0.1 0.1

0 20 40 60 0 25 50 75 100 125

T=10000 T=50000 0.3 0.3

0.2 0.2 Loss

0.1 0.1

0 100 200 300 400 0 200 400 600 800 1000 Time (seconds) Time (seconds)

Fig. 2. Algorithm performance on synthetic data. The vertical axis denotes normalized loss of the CNMF model, kX − Xb k/kXk; the horizontal axis denotes cumulative computation time. As dataset size increases (denoted by number of timebins T ) the performance of HALS and ANLS improves relative to multiplicative updates. For T = 500 and T = 2500 all three algorithms perform similarly. For T = 10000, multiplicative updates takes significantly longer to converge. Finally, for T = 50000, multiplicative updates makes little to no progress in the allotted time (1000 seconds), whereas both ANLS and HALS rapidly converge.

otherwise identical datasets with T = 500, T = 2500, Relative Loss on Songbird Data T = 10000, and T = 50000. 0.9 • Each w (the length-L fibers of W) followed HALS :nk a randomly shifted Gaussian curve. Specifically, let MULT 0.8 ANLS f(τ; µnk, σ) denote a univariate Gaussian probability distribution function with mean µ and standard deviation 0.7 Loss σ. We set σ = 0.2 and sampled µnk uniformly at random between −1 and 1. We then randomly sampled ampli- 0.6 tude parameters from a symmetric Dirichlet distribution αn ∼ Dir(0.1), achieving approximately sparse vectors 0 10 20 30 40 50 60 K αn ∈ R+ representing loadings across each component. Time (seconds) Finally, we set W`nk = αnkf(2`/L − 1; µnk, σ), for ` = 1,...,L. This procedure was repeated for each feature n = 1,...,N and component k = 1,...,K. • Each element in H was set to zero with probability 0.1, Fig. 3. Algorithm performance on a songbird spectrogram from [10]. ANLS and HALS perform similarly and nearly converge after 20 seconds; and otherwise randomly sampled from an exponential multiplicative updates takes approximately three times as long to achieve the distribution with a rate parameter λ = 1. Similar to same objective value. our construction of W, this produced a synthetic dataset with sparse factors, in agreement with previously reported A. Synthetic Data results on real data (e.g. [10]). • The ground truth matrix is given by Xtrue = In this experiment, we test each algorithm on synthetic data PL `=1 W`HS`−1. We then added truncated Gaussian of various sizes. The synthetic datasets were generated from true noise, Xnt = max(0, Xnt + ent) where each ent was a CNMF model with added noise, as follows: drawn uniformly from a standard normal distribution • The dimensional parameters were chosen to be N = 250, (zero-mean and unit standard deviation). The matrix X L = 20, K = 5. We generated and examined four 7

Comparative Loss on Speech Dataset Section of Speech Dataset 0 0.175 HALS MULT 50 0.150 ANLS 100 0.125

Loss 0.100 150 DFT Bin

0.075 200

0.050 250 0 1000 2000 3000 4000 5000 6000 7 8 9 10 11 12 Time (seconds) Time (seconds)

Fig. 4. Comparison of algorithms on a large speech dataset. Left: Normalized loss, kX−Xb k/kXk, achieved by each algorithm as a function of computation time. Both HALS and ANLS converge significantly faster than multiplicative updates. Right: a small slice of the speech dataset, representative of the full recording.

Fig. 5. The twenty components W::k recovered by multiplicative updates, HALS, and ANLS. The vertical axis spans frequency (DFT bin) and the horizontal axis spans time. One rectangle surrounded by a colored border is a single W::k. The borders of each component are colored to match the previous loss plots. For all three algorithms, the components are perceptually similar but appear in different orders. Specifically, each algorithm recovers harmonic stacks that correspond to different sounds frequently spoken during the recording.

was given as input to all algorithms. 3). All algorithms find perceptually similar components (data Convergence on synthetic data is shown in Figure 2. For not shown). small dataset sizes, all three algorithms give similar perfor- mance. As dataset size grows, however, HALS and ANLS C. Qualitative Results on a Large Speech Dataset converge much more quickly than MU. This is best illustrated In this experiment, we fit the CNMF model on a large when T = 50000 columns. On this large dataset, MU fails to dataset consisting of two males speaking as part of an inter- converge within the 1000 second limit. view. Following the procedure in [31], we down-sample the audio recording to 8KHz and compute a magnitude spectrum B. Results on a Songbird Spectrogram using an FFT window of 512 samples, and an overlap of In this experiment, we fit the CNMF model on a song- 384. This yields a data matrix of size 257 × 20149 which bird spectrogram from [10] (available at github.com/FeeLab/ we fit using K = 20 components and motif length of L = 12 seqNMF). The dimensions of the data matrix are 141 DFT bins time-steps. As a final preprocessing step, we log-transform (rows) by 4440 timebins (columns), and we use a motif length the spectrogram and add a constant (so that all entries are of L = 50 and K = 3 factors. The timebins are sampled at nonnnegative). 200 Hz. We run each algorithm for 60 seconds and plot the We observe that HALS and ANLS converge to their final relative loss over time. We find that both HALS and ANLS loss roughly 5x faster than MU. A small section of the converge after around 20 seconds, whereas multiplicative magnitude spectrogram, along with a convergence comparison, updates fails to converge within the 60 second time limit (Fig. is shown in Figure 4. 8

PL A natural question is whether the components found by We define the matrix V = `=1 S1−` ⊗ W`, which is also HALS and ANLS are similar to those found by MU. We find written as that this is indeed the case. Figure 5 shows that components W 0 ... 0  recovered by all three algorithms are perceptually similar, 1 W2 W1 ... 0  each containing distinctive horizontal bands which correspond    . . .  to the harmonics found in human speech. The components  . . .    extracted in this experiment look similar to those found by V = WL WL−1 ... 0  (30)   [31].  0 WL ... 0     . . .  V. CONCLUSION  . . .  0 0 W In this paper, we have shown how to extend two popular 1 algorithms for NMF, HALS and ANLS, to the Convolutive where 0 is a T × T matrix of zeros. Thus the Kronecker form NMF problem. Both algorithms offer faster convergence rates is concisely written as vec(f(W, H)) = V vec(H). than MU, with speedups of around 5x noted on a large Alternatively, we can take the transpose of equation (24) dataset, and were observed to recover qualitatively similar and apply (27) to write the Toeplitz form of the CNMF motifs. In situations where the practitioner must perform a approximation parameter search over regularization strengths or the number L of motifs, this speedup is of practical value. Future research T X T could investigate improvements to the ANLS algorithm by vec(f(W, H) ) = (W` ⊗ S1−`) vec(H ) (31) incorporating specialized nonnegative least squares solvers and l=1 potentially exploiting the block Toeplitz structure of eq. 23. which is also written as To handle even larger datasets, randomized variants of the  T (w ) ... T (w )  CNMF algorithms described here could also be developed, in :11 :1K T  . .. .  analogy to recently proposed randomized variants of HALS in vec(f(W, H) ) =  . . .  (32) NMF [14]. Overall, we expect these improvements to enable T (w:N1) ... T (w:NK ) convolutional factor modeling on a variety of high-dimensional T ×T time series data with much longer durations than what has been where T (v) ∈ R is a Toeplitz matrix defined for any L previously explored. vector v ∈ R as   v1 0 ... 0 APPENDIX A v2 v1 0  REFORMULATIONS OF THE CNMF OBJECTIVE    . . .  classical form  . . .  The of the CNMF approximation is   T (v) = vL vL−1 ... 0  (33) L   X  0 vL ... 0  f(W, H) = W`HS`−1 (24)    . . .  `=1  . . .  We define the convolution operator as 0 0 ... v1 T T T X T i.e. the `th diagonal below the main diagonal is equal to v`+1. W::k ∗ hk = Hkτ [W::k]τ (25) τ=1 T  T  APPENDIX B where [W::k]τ = 0τ−1 W::k 0T +1−L−τ and 0p is a N× p matrix of zeros. This allows us to write the outer product DERIVATIONS OF THE HALS UPDATE RULES form of the CNMF approximation A. HALS for NMF K X T T Consider the NMF objective, written as f(W, H) = W::k ∗ hk (26) k=1 K 2 X T Another useful formulation comes from considering Kro- minimize X − wkhk w1,...,wk necker identities. Given three matrices A, X, B, we know h1,...,hk k=1 vec(AXB) = (BT ⊗ A) vec(X) (27) subject to wk ≥ 0, hk ≥ 0 ∀k ∈ {1, 2, ..., K} from [26]. This leads us to the Kronecker form of the CNMF We will derive a closed-form update rule that updates a single T approximation, which is column wk or a single row hk . By the symmetry of the L problem, it suffices to derive this update rule for wk only. First P T X T choose k and let E = X − wphp . Our minimization vec(f(W, H)) = (S`−1 ⊗ W`) vec(H) (28) p6=k `=1 problem is now L T 2 X minimize E − wkhk = (S1−` ⊗ W`) vec(H) (29) wk `=1 subject to wk ≥ 0 REFERENCES 9

Applying the identity kXk2 = Tr(XTX), we can rewrite This gives us the KKT conditions T 2 J(wk) = E − wkhk as 1 Tr(W::kR) + 2 λ Hkt = (43) T  T T T T kWT k2 J(wk) = Tr E E + Tr(hkwk wkhk ) − 2 Tr(E wkhk ) ::k T  T T T T Hkt ≥ 0 (44) = Tr E E + Tr(hk hkwk wk) − 2 Tr(hk E wk) 2 2 2 T T λ ≥ 0 (45) = kEk + khkk kwkk − 2hk E wk λHkt = 0 (46) Next, we write the Langrangian as Equation (46), which is referred to as the complementary T  2 2 T T T L(wk, λ) = Tr E E + khkk kwkk − 2hk E wk − λ wk slackness condition, implies that either λ or Hkt, must be zero. This allows us to update Hkt using the closed form which has gradient update rule from Section III-A. 2 ∇wk L(wk, λ) = 2khkk wk − 2Ehk − λ REFERENCES Setting this equal to zero gives us the KKT conditions [1] D. D. Lee and H. S. Seung, “Learning the parts of 1 objects by non-negative matrix factorization,” Nature, Ehk + λ w = 2 (34) vol. 401, p. 788, Oct. 1999. [Online]. Available: https: k kh k2 k //doi.org/10.1038/44565%20http://10.0.4.14/44565. λ ≥ 0 (35) [2] N. Gillis, “The why and how of nonnegative matrix wk ≥ 0 (36) factorization,” Regularization, Optimization, Kernels, and Support Vector Machines, vol. 12, no. 257, 2014. (wk)iλi = 0 ∀i = 1, 2,...,N (37) [3] P. Smaragdis and J. C. Brown, “Non-negative ma- If (Ehk)i ≥ 0, then we must have λi = 0 to satisfy equation trix factorization for polyphonic music transcription,” (37). If (Ehk)i ≥ 0, then we must have (wk)i = 0. This leads in 2003 IEEE Workshop on Applications of Sig- to the closed form solution nal Processing to Audio and Acoustics (IEEE Cat.   No.03TH8684), Oct. 2003, pp. 177–180. Ehk wk = max 0, (38) [4] S. Jia and Y. Qian, “Constrained nonnegative matrix kh k2 k factorization for hyperspectral unmixing,” IEEE Trans- actions on Geoscience and Remote Sensing, vol. 47, B. HALS for CNMF no. 1, pp. 161–173, Jan. 2009, ISSN: 0196-2892. DOI: For the CNMF model, the HALS update rule loses its 10.1109/TGRS.2008.2002882. symmetry across W and H. However, as demonstrated in [5] J. K. Liu, H. M. Schreyer, A. Onken, F. Rozenblit, Section III-A, the update rule for W can be derived using M. H. Khani, V. Krishnamoorthy, S. Panzeri, and T. the HALS update rule for NMF. To derive the update rule Gollisch, “Inference of neuronal functional circuitry for H, we begin by choosing k, t and defining E = X − with spike-triggered non-negative matrix factorization,” P H WT  . From (18), we can update H Nature Communications, vol. 8, no. 1, p. 149, 2017, (p,τ)6=(k,t) p,τ ::p τ kt with the optimization problem ISSN: 2041-1723. DOI: 10.1038/s41467-017-00156-9. [Online]. Available: https://doi.org/10.1038/s41467- T 2 minimize J(Hkt) = E − Hkt[W::k]t 017-00156-9. Hkt (39) [6] R. H. R. Hahnloser, A. A. Kozhevnikov, and M. S. subject to Hkt ≥ 0 Fee, “An ultra-sparse code underliesthe generation of neural sequences in a songbird,” Nature, vol. 419, no. Since [WT ] only interacts with L columns of E, we can ::k t 6902, pp. 65–70, 2002, ISSN: 1476-4687. DOI: 10.1038/ define R = E and write (39) as :,t+L−1 nature00974. [Online]. Available: https://doi.org/10. T 2 1038/nature00974. minimize J(Hkt) = R − HktW::k Hkt (40) [7] S. Fujisawa, A. Amarasingham, M. T. Harrison, and subject to Hkt ≥ 0 G. Buzsaki,´ “Behavior-dependent short-term assembly dynamics in the medial prefrontal cortex,” Nature neu- Since Hkt is just a scalar, it is quite simple to derive a closed roscience, vol. 11, no. 7, p. 823, 2008. form update rule. The corresponding Lagrangian is [8] P. Smaragdis, “Non-negative matrix factor deconvolu- tion; extraction of multiple sound sources from mono- L(H , λ) = kRk2 + H2 kWT k2 (41) kt kt ::k phonic inputs,” in Independent Component Analysis and − 2Hkt Tr(W::kR) − λHkt Blind Signal Separation, C. G. Puntonet and A. Prieto, which has gradient Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 494–499, ISBN: 978-3-540-30110-3. T 2 ∇Hkt L(Hkt, λ) = 2HktkW::kk (42)

− 2 Tr(W::kR) − λ 10

[9] P. Smaragdis et al., “Convolutive speech bases and 2015, ISSN: 1436-4646. DOI: 10 . 1007 / s10107 - 015 - their application to supervised speech separation,” IEEE 0892-3. [Online]. Available: https://doi.org/10.1007/ Transactions on audio speech and language processing, s10107-015-0892-3. vol. 15, no. 1, p. 1, 2007. [20] S. Vavasis, “On the complexity of nonnegative matrix [10] E. L. Mackevicius, A. H. Bahle, A. H. Williams, S. factorization,” SIAM Journal on Optimization, vol. 20, Gu, N. I. Denissenko, M. S. Goldman, and M. S. no. 3, pp. 1364–1377, 2010. DOI: 10.1137/070709967. Fee, “Unsupervised discovery of temporal sequences in [21] D. Donoho and V. Stodden, “When does non-negative high-dimensional datasets, with applications to neuro- matrix factorization give a correct decomposition into science,” BioRxiv, 2018. DOI: 10.1101/273128. eprint: parts?” In Advances in neural information processing https://www.biorxiv.org/content/early/2018/03/02/ systems, 2004, pp. 1141–1148. 273128 . full . pdf. [Online]. Available: https : / / www. [22] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing biorxiv.org/content/early/2018/03/02/273128. a nonnegative matrix factorization—provably,” SIAM [11] V. Ramanarayanan, A. Katsamanis, and S. Narayanan, Journal on Computing, vol. 45, no. 4, pp. 1582–1611, “Automatic data-driven learning of articulatory primi- 2016. DOI: 10.1137/130913869. tives from real-time mri data using convolutive nmf with [23] A. Cichocki, R. Zdunek, and S.-i. Amari, “Hierarchical sparseness constraints,” in Twelfth Annual Conference of als algorithms for nonnegative matrix and 3d tensor fac- the International Speech Communication Association, torization,” in International Conference on Independent 2011. Component Analysis and Signal Separation, Springer, [12] J. Zhou, R. Liang, L. Zhao, L. Tao, and C. Zou, “Unsu- 2007, pp. 169–176. pervised learning of phonemes of whispered speech in [24] C.-J. Lin, “Projected gradient methods for nonnegative a noisy environment based on convolutive non-negative matrix factorization,” Neural computation, vol. 19, no. matrix factorization,” Information Sciences, vol. 257, 10, pp. 2756–2779, 2007. pp. 115–126, 2014. [25] R. A. Polyak, “Projected gradient method for non- [13] R. Kannan, G. Ballard, and H. Park, “A high- negative least square,” Contemp Math, vol. 636, performance parallel algorithm for nonnegative matrix pp. 167–179, 2015. factorization,” SIGPLAN Not., vol. 51, no. 8, 9:1–9:11, [26] R. A. Horn and C. R. Johnson, Topics in matrix anal- Feb. 2016, ISSN: 0362-1340. DOI: 10.1145/3016078. ysis. Cambridge University Press, 1991. DOI: 10.1017/ 2851152. [Online]. Available: http://doi.acm.org/10. CBO9780511840371. 1145/3016078.2851152. [27] J. T. Vogelstein, A. M. Packer, T. A. Machado, T. Sippy, [14] N. B. Erichson, A. Mendible, S. Wihlborn, and J. N. B. Babadi, R. Yuste, and L. Paninski, “Fast nonnegative Kutz, “Randomized nonnegative matrix factorization,” deconvolution for spike train inference from population Pattern Recognition Letters, vol. 104, pp. 1–7, 2018, calcium imaging,” Journal of neurophysiology, vol. 104, ISSN: 0167-8655. DOI: https : / / doi . org / 10 . 1016 / no. 6, pp. 3691–3704, 2010. j . patrec . 2018 . 01 . 007. [Online]. Available: http : [28] A. Beck and L. Tetruashvili, “On the convergence of / / www . sciencedirect . com / science / article / pii / block coordinate descent type methods,” SIAM journal S0167865518300138. on Optimization, vol. 23, no. 4, pp. 2037–2060, 2013. [15] B. Zupan et al., “Nimfa: A python library for nonnega- [29] J. Kim and H. Park, “Fast nonnegative matrix factor- tive matrix factorization,” Journal of Machine Learning ization: An active-set-like method and comparisons,” Research, vol. 13, no. 3, pp. 849–853, 2012. SIAM Journal on Scientific Computing, vol. 33, no. 6, [16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, pp. 3261–3281, 2011. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, [30] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. “Julia: A fresh approach to numerical computing,” Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, SIAM review, vol. 59, no. 1, pp. 65–98, 2017. “Scikit-learn: Machine learning in Python,” Journal of [31] P. D. O’grady and B. A. Pearlmutter, “Discovering Machine Learning Research, vol. 12, pp. 2825–2830, convolutive speech phones using sparseness and non- 2011. negativity,” in International Conference on Independent [17] D. D. Lee and H. S. Seung, “Algorithms for non- Component Analysis and Signal Separation, Springer, negative matrix factorization,” in Advances in neural 2007, pp. 520–527. information processing systems, 2001, pp. 556–562. [18] J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework,” Journal of Global Optimization, vol. 58, no. 2, pp. 285–319, Feb. 2014, ISSN: 1573-2916. DOI: 10 . 1007 / s10898 - 013 - 0035-4. [Online]. Available: https://doi.org/10.1007/ s10898-013-0035-4. [19] S. J. Wright, “Coordinate descent algorithms,” Mathe- matical Programming, vol. 151, no. 1, pp. 3–34, Jun.