Convex nonnegative matrix factorization with Ronan Hamon, Valentin Emiya, Cédric Févotte

To cite this version:

Ronan Hamon, Valentin Emiya, Cédric Févotte. Convex nonnegative matrix factorization with missing data. IEEE International Workshop on Machine Learning for Signal Processing, Sep 2016, Vietri sul Mare, Salerno, Italy. ￿hal-01346492￿

HAL Id: hal-01346492 https://hal-amu.archives-ouvertes.fr/hal-01346492 Submitted on 7 Oct 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 13–16, 2016, SALERNO, ITALY

CONVEX NONNEGATIVE MATRIX FACTORIZATION WITH MISSING DATA

Ronan Hamon, Valentin Emiya˚ Cedric´ Fevotte´

Aix Marseille Univ, CNRS, LIF, Marseille, France CNRS & IRIT, Toulouse, France

ABSTRACT example concerns the field of image or audio inpainting [5, 6, 7, 8], where CNMF may improve the current reconstruction Convex nonnegative matrix factorization (CNMF) is a variant techniques. In inpainting of audio spectrograms for example, of nonnegative matrix factorization (NMF) in which the com- setting up the dictionary to be a comprehensive collection of ponents are a convex combination of atoms of a known dic- notes from a specific instrument may guide the factorization tionary. In this contribution, we propose to extend CNMF to toward a realistic and meaningful decomposition, increasing the case where the data matrix and the dictionary have miss- the quality of the reconstruction of the missing data. In this ing entries. After a formulation of the problem in this context contribution, we also consider the case where the dictionary of missing data, we propose a majorization-minimization al- may have missing coefficients itself. gorithm for the solving of the optimization problem incurred. Experimental results with and audio spectro- The paper is organized as follows. Section 2 formulates grams highlight an improvement of the performance of re- CNMF in the presence of missing entries in the data ma- construction with respect to standard NMF. The performance trix and in the dictionary. Section 3 describes the proposed gap is particularly significant when the task of reconstruction majorization-minimization (MM) algorithm. Sections 4 and 5 becomes arduous, e.g. when the ratio of missing data is high, report experimental results with synthetic data and audio the noise is steep, or the complexity of data is high. spectrograms. Index Terms— matrix factorization, nonnegativity, low- rankness, matrix completion, spectrogram inpainting 2. CONVEX NONNEGATIVE MATRIX FACTORIZATION WITH MISSING DATA

1. INTRODUCTION 2.1. Notations and definitions

Convex NMF (CNMF) [1] is a special case of nonnegative For any integer N, the integer set t1, 2,...,Nu is denoted by matrix factorization (NMF) [2], in which the matrix of com- rNs. The coefficients of a matrix A P RMˆN are denoted ponents is constrained to be a linear combination of atoms by either amn or rAsmn. The element-wise matrix product, A of a known dictionary. The term “convex” refers to the con- matrix division and matrix power are denoted by A.B, B straint of the linear combination, where the combination co- and A.γ , respectively where A and B are matrices with same efficients forming each component are nonnegative and sum dimensions and γ is a scalar. 0 and 1 denote vectors or ma- to 1. Compared to the fully unsupervised NMF setting, the trices composed of zeros and ones, respectively, with dimen- use of known atoms is a source of supervision that may guide sions that can be deduced from the context. The element-wise learning based on this additional data: in particular, an inter- negation of a binary matrix M is denoted by M¯ fi 1 ´ M. esting case of CNMF consists in auto-encoding the data them- selves, by defining the atoms as the data matrix. CNMF has 2.2. NMF and Convex NMF been of interest in a number of contexts, such as clustering, F ˆN , face recognition, or music transcription [1, 3]. NMF consists in approximating a data matrix V P R` as It is also related to the self-expressive dictionary-based repre- F ˆK the product WH of two nonnegative matrices W P R` sentation proposed in [4]. KˆN and H P R` . Often, K ă min pF,Nq, such that WH An issue that has not yet been addressed is when the is a low-rank approximation of V. Every sample vn, the n- data matrix has missing coefficients. Such an extension of th column of V, is thus decomposed as a linear combination CNMF is worth being considered, as it opens the way to F of K elementary components or patterns w1,..., wK P R`, data-reconstruction settings with nonnegative low-rank con- the columns of W. The coefficients of the linear combination straints, which covers several relevant applications. One are given by the n-th column hn of H. ˚This work was supported by ANR JCJC program MAD (ANR- 14- In [9] and [10], algorithms have been proposed for the un- CE27-0002). supervised estimation of W and H from V, by minimization

978-1-5090-0746-2/16/$31.00 c 2016 IEEE of the cost function Dβ pV|WHq “ fn dβ pvfn|rWHsfnq, Algorithm 1 CNMF with missing data o where dβ px|yq is the β-divergence defined as: Require: V, S , M , M , β ř V S Initialize S, L, H with random nonnegative values 1 β β β´1 βpβ´1q x ` pβ ´ 1q y ´ βxy loop Update S: $ ` for β P Rz t0, 1˘u dβ px|yq fi (1) ’ x o ’x log y ´ x ` y for β “ 1 S Ð M .S ` (5) &’ S x ´ log x ´ 1 for β “ 0 .γpβq y y M . pSLHq.pβ´2q .V pLHqT ’ M .S. V ’ S ´ .pβ´1q ¯ T When ill-defined,% we set by convention dβp0|0q “ 0. ¨ M . pSLHq pLHq ˛ V CNMF is a variant of NMF in which W “ SL. S “ ˝ ´ ¯ ‚ F ˆP rs1,..., sP s P R` is a nonnegative matrix of atoms and Update L: P ˆK L “ rl1,..., lK s P R` is the so-called labeling matrix. .γpβq T .pβ´2q T Each dictionary element wk is thus equal to Slk, with usu- S M . pSLHq .V H L Ð L. V (6) ally P K, and the data is in the end decomposed as .pβ´1q ąą ¨ ST´ M . pSLHq ¯HT ˛ V “ SLH. The scale indeterminacy between L and H may V ˝ ´ ¯ ‚ be lifted by imposing }lk}1 “ 1, in which case wk is pre- Update H: cisely a convex combination of the elements of the subspace .γpβq S. CNMF can be related to the so-called archetypal analysis pSLqT M . pSLHq.pβ´2q .V [11], but without considering any nonnegativity constraint. H Ð H. V (7) ¨ pSLqT´ M . pSLHq.pβ´1q ¯ ˛ The use of known examples in S can then be seen as a V source of supervision that guides learning. A special case of ˝ ´ ¯ ‚ CNMF is obtained by setting S “ V, thus auto-encoding Rescale L and H: the data as VLH. This particular case is studied in depth @k P rKs , h Ð}l } ˆ h (8) in [1]. In this paper, we consider the general case for S, with k k 1 k or without missing data. lk lk Ð (9) }lk}1 2.3. Convex NMF with missing data end loop We assume that some coefficients in V and S may be missing. return S, L, H Let Ă rF s ˆ rNs be a set of pairs of indices that locates V the observed coefficients in V: pf, nq P iff vfn is known. Similarly, let Ă rF s ˆ rP s be a set of pairsV of indices that 3. PROPOSED ALGORITHM locates the observedS coefficients in S. The use of sets and may be reformulated equivalently by defining maskingV ma- 3.1. General description of the algorithm S F ˆN F ˆP trices MV P t0, 1u and MS P t0, 1u from and Algorithm 1 extends the algorithm proposed in [9] for com- as V S plete CNMF with the β-divergence to the case of missing en- tries in V or S. The algorithm is a block-coordinate descent 1 if pf, nq P procedure in which each block is one the three matrix factors. rMV sfn fi V @pf, nq P rF s ˆ rNs (2) #0 otherwise The updates of each block/factor is obtained via majorization- minimization (MM), a classic procedure that consists in itera- 1 if pf, pq P tively minimizing a tight upper bound (called auxiliary func- rMS sfp fi S @pf, pq P rF s ˆ rP s (3) #0 otherwise tion) of the objective function. In the present setting, the MM procedure leads to multiplicative updates, characteristic A major goal in this paper is to estimate L, H and the of many NMF algorithms, that automatically preserve non- missing entries in S, given the partially observed data matrix negativity given positive initialization. V. Denoting by So the set of observed/known dictionary ma- trix coefficients, our aim is to minimize the objective function 3.2. Detailed updates

C pS, L, Hq fi Dβ pMV .V|MV .SLHq (4) We consider the optimization of C pS, L, Hq with respect to each of its three arguments individually, using MM. Current F ˆP P ˆK KˆN updates are denoted with a tilde, i.e., S, L and H. We start by subject to S P R` , L P R` , H P R` , and o MS .S “ MS .S . The particular case where the dic- recalling the definition of an auxiliary function: tionary equals the data matrix itself is obtained by setting r r r o pMS , S q fi pMV , Vq. Definition 1 (Auxiliary function). The mapping G A|A : ´ ¯ r IˆJ IˆJ R` ˆ R` ÞÑ R` is an auxiliary function to C pAq iff is an auxiliary function to C pS, L, Hq with respect to L o and its minimum subject to MS .S “ MS .S for MS P IˆJ F ˆP o F ˆP @A P R` ,C pAq “ G pA|Aq t0, 1u and S P R` is given by equation (6). (10) , IˆJ ,C G . #@A A P R` pAq ď A|A Proposition 3 (Auxiliary function for H). KˆN ´ ¯ Let us define W fi SL and let H P R` be such that The iterativer minimization of G pA|Aq withr respect to A, @pf, nq P rF s ˆ rNs, vfn ą 0 and @pk, nq P rKs ˆ with replacement of A at every iteration, monotonically de- rNs, hkn ą 0, where V fi WH. Thenr the function creases the objective C pAq until convergence. As explained r in detail in [9], the β-divergencer may be decomposed into the GH rH|H fi rMrV sfn Grfn H|H ` Gfn H|H ` cst sum of a convex term dβ p.|.q, a concave term dβ p.|.q and a fn constant term cst. The first two terms can be majorized using ´ ¯ ÿ ” ´ ¯ ´ ¯ı r q r u r routine Jensen and tangentq inequalities, respectively,u leading wfkhkn hkn where G H|H fi d v |v to tractable updates. The auxiliary functions used to derive fn v β fn fn k fn hkn Algorithm 1 are given by the three following propositions 1 ´ ¯ ÿ r ˆ ˙ Gq H|Hr fi d pv |v q r and the monotonicity of the algorithm follows by construc- fn β fnr fn r tion. ´ ¯ ` d1 pv |v q w h ´ h u r u β fnr fn fk kn kn Proposition 1 (Auxiliary function for S). k F ˆP ÿ ´ ¯ Let S P R` be such that @pf, nq P rF s ˆ rNs, v˜fn ą 0, u r is an auxiliary function to C pS, Lr, Hq with respect to H and and @pf, pq P rF s ˆ rP s, s˜ ą 0, where V fi SLH. Then fp its minimum is given by equation (7). the functionr r r 4. EXPERIMENT ON SYNTHETIC DATA GS S|S fi rMV sfn Gfn S|S ` Gfn S|S ` cst fn ´ ¯ ÿ ” ´ ¯ ´ ¯ı 4.1. Experimental setting r q r u r rLHs sfp s where G S|S fi pn d v |v fp fn v β fn fn s The objective of this experiment is to analyze the performance p fn ˆ fp ˙ ´ ¯ ÿ r of CNMF for reconstructing missing data, by comparing it q r q ˚ Gfn S|S fi dβ pvfn|vfnq r with the regular NMF. We consider a data matrix V of rank r r K˚ synthesized under the CNMF model V˚ “ S˚L˚H˚, ´ ¯ 1 ˚ ` d pvfn|vfnq rLHs psfp ´ sfpq where the matrix of atoms S and the ground truth factors u r u β r pn fp ˚ and ˚ are generated as the absolute values of Gaussian ÿ L H u noise. It is worth noting that V˚ is also consistent with a is an auxiliary function to C pS,rL, Hq with respect to Srand NMF model by defining W˚ “ S˚L˚. A perturbed data ma- its minimum is given by equation (5). The auxiliary function trix V is obtained by considering a multiplicative noise, ob- decouples with respect to the individual coefficients of S and tained using a Gamma distribution with mean 1 and variance as such, the constraint M .S “ M .So is directly imposed S S 1 . Hence the parameter α controls the importance of the per- by only updating the coefficients of S with indices in ¯. α S turbation. The mask MV of known elements in V is derived Proposition 2 (Auxiliary function for L). by considering missing coefficients randomly and uniformly P ˆK distributed over the matrix, such that the ratio of missing val- Let L P R` be such that @pf, nq P rF s ˆ rNs, v˜fn ą 0 ˜ ues is equal to σV . Generation of data is repeated 3 times, as and @pp, kq P rP s ˆ rKs, lpk ą 0, where V fi SLH. Then the functionr well as the generation of the masks. Results are averaged over these repetitions. r r From a matrix V with missing entries, NMF and CNMF GL L|L fi rMV sfn Gfn L|L ` Gfn L|L ` cst fn with missing values are applied using K components. Only ´ ¯ ÿ ” ´ ¯ ´ ¯ı β 2 r q r u r the case where “ has been considered in this experiment. sfplpkhkn lpk In both algorithms, the convergence is reached when the rel- where G L|L fi d v |v fn v β fn fn ative difference of the cost function between two iterations pk fn ˜ lpk ¸ ´ ¯ ÿ r is below 10´5. 3 repetitions are performed using different Gq r d v v q r fn L|L fi β p fnr| fnq r random initialization, and the best instance (i.e., the instance ´ ¯ which minimizes the cost function) is retained. The recon- ` d1 pv |v q s h l ´ l u r u β fnr fn fp kn pk pk structed data matrix is obtained as V “ SLH. pk ÿ ´ ¯ The reconstruction error is obtained by computing the β- u r ˚ 1The proof of these propositions are availabler in the extended version at divergence between the noiseless datar matrix V , and the re- https://hal-amu.archives-ouvertes.fr/hal-01346492. constructed matrix V; the error is computed on and averaged

r along the missing coefficients only, as for CNMF or NMF, the former still performing better than the latter. These first results outline the difference between NMF 1 ¯ ˚ ¯ etest “ dβpMV .V , MV Vq (11) and CNMF, emphasized in the next figures. rM¯ s ij V ij This comparison is augmented by considering the case r where S “ V, with respect to the ratio σV of missing values where M¯ V is the maskř of unknown elements in . In the case of CNMF, we consider two choices for S: the dataV matrix V in V. Figure 2 displays the performance of NMF and CNMF ˚ as a function of σV , for two noise levels. CNMF with the true with missing values, and the ground truth matrix of atoms S , ˚ considered here without missing values. atoms (S “ S ) gives the best results on the full range miss- The following parameters are fixed: F “ 100, N “ 250, ing data ratio. When there are very few missing data and a P “ 50, K˚ “ 10, β “ 2. We particularly investigate the in- low noise, NMF performs almost as well as CNMF. However, fluence of four factors: the number of estimated components the NMF error increases much faster than the CNMF error as the number of missing data grows, or as the noise in data be- K P r2, 14s; the ratio of missing data σV P r0.1, 0.9s in V, ˚ comes important. This higher sensitivity of NMF to missing i.e., of zeros in MV ; the choice of the matrix S P tS , Vu for the CNMF; the noise level α P r10, 5000s in V (α is inversely data may be explained by overfitting since the number of free proportional to the variance of the noise). parameters in NMF is higher than in CNMF. In the case of CNMF with S “ V, the model cannot fit the data as well as CNMF with S “ S˚ or as NMF. Consequently, the resulting 4.2. Results modeling error is observed when there is few missing data, We first focus on the influence of the number of estimated and when comparing S “ V and S “ S˚ on all values. How- components k for the case where the true dictionary is fully ever, it performs better than NMF at high values of σV since known. Figure 1 displays the test error with respect to the the constraint S “ V can be seen as a regularization. number of estimated components K, for two levels of noise. We finally investigate the robustness of methods by look- Performance of reconstruction obtained by NMF and CNMF ing at the influence of the multiplicative noise, controlled by with S “ S˚ are plotted, for different values of ratio of miss- the parameter α, on the performance. Figure 3 shows the test ing values in . error for the NMF and the CNMF with S “ S˚ with respect V to α and for some values of σV . As expected, the test er- 0 α = 100 0 α = 3000 10 10 ror decreases according to the variance of the noise, inversely σ = 0.2 - NMF V σ = 0.2 - CNMF V proportional to α. If a low value of α disrupts abruptly the σ = 0.6 - NMF V σ = 0.6 - CNMF 3 V performance of reconstruction (α ă 10 ), the test error is σ = 0.8 - NMF V σ = 0.8 - CNMF 3 V slightly decreasing for α ą 10 . When the variance of the 1 10− noise is close to zero (α “ 5000), the performance of NMF and CNMF are almost the same. The performance of recon- 10 1 − struction differs when the variance of the noise increases, as Test error Test error 2 well as the ratio of missing values in V. 10−

100 α = 100 100 α = 3000 NMF

S = S∗ - CNMF 2 3 10− 10− S = V - CNMF 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Number of components Number of components

1 10− Fig. 1. Test error vs. number of components. Two levels of

1 noise are displayed: high noise level (α “ 100, left) and low 10− noise level (α “ 3000, right). Test error Test error 2 10− These results show that the noise level has a high influence on the best number of estimated components. As expected, a high noise requires a strong regularization, obtained here 2 3 10− 10− by selecting a low value of K. On the contrary, when the 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 V V noise is low, the best choice of K is closer to the true value Ratio of missing data in Ratio of missing data in K˚ “ 10. Similarly, when the number of missing data is Fig. 2. Test error vs. ratio of missing data in , with K low (σ “ 0.2), one should set K to K˚ “ 10, either for V “ V K˚ α CNMF or for NMF. When it gets higher, the optimal K gets . Two levels of noise are displayed: high noise level ( “ 100 α 3000 lower in order to limit the effect of overfitting. In this case, , left) and low noise level ( “ , right). the best number of components drops down to K “ 2, either 101 σ = 0.2 - NMF the recording of another piano instrument from the MAPS V σ = 0.2 - CNMF V 2 σ = 0.6 - NMF dataset, from C3 to C8 . V σ = 0.6 - CNMF V σ = 0.8 - NMF 0 V 10 σ = 0.8 - CNMF V 5.2. Results

1 10− Figure 4 displays the test error with respect to the ratio of missing data in V, averaged over the 17 piano pieces. It Test error clearly shows that the CNMF with the specific dictionary S “ 2 10− D is much more robust to missing data than the other two sys- tems. When less than 40% of data are missing, NMF performs slightly better; however, the NMF test error dramatically in- 3 10− 0 1000 2000 3000 4000 5000 creases when more data are missing, by a factor 5.103 when α more than 80% data are missing. This must be due to overfit- ting since NMF has a large number of free parameters to be Fig. 3. Test error vs. noise level α. Each curve describes estimated from very few observations when data are missing. the performance of reconstruction of missing data in V, with The performance of the CNMF system with S “ V probably K “ K˚, according to the method and the ratio of missing suffers from modelling error when very few data are missing data σ . V – since the columns of V may not be able to combine into K components in a convex way. In the range 50 ´ 70%, its 5. APPLICATION TO SPECTROGRAM INPAINTING performance is similar to that of NMF. Beyond this range, it seems to be less prone to overfitting than NMF, probably due In order to illustrate the performance of the proposed algo- to less free parameters or to a regularization effect provided rithm on real data, we consider spectrograms of piano music by S “ V. signals, which are known to be well modeled by NMF meth- 102 ods [12]. Indeed, NMF may provide a note-level decomposi- NMF ConvexNMF - S=D tion of a full music piece, each NMF component being the es- ConvexNMF - S=V timated spectrum and activation of a single note. This approx- 101 imation has proved successful and is also limited in terms of modelling error and of lack of supervision to guide the NMF 100 algorithm. In such condition, we have designed an experi-

ment with missing data to compare regular NMF against two Test error 1 10− CNMF variants: in the first one, we set S “ V; in the second one, contains examples of all possible isolated note spectra S 2 10− from another piano.

3 10− 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 5.1. Experimental setting Ratio of missing data in V We consider 17 piano pieces from the MAPS dataset [13]. For Fig. 4. Test error vs. missing data ratio in audio spectrograms. each recording, the magnitude spectrogram is computed using a 46-ms sine window with 50%-overlap and 4096 frequency We now investigate the influence of the “complexity” of bins, the sampling frequency being 44.1kHz. Matrix V is the audio signal on the test error when the ratio of missing created from the resulting spectrogram by selecting the F “ data is set to the high value 80%. Figure 5 displays, for each 500 lower-frequency part and the first five seconds, i.e., the music recording, the test error with respect to the number of N “ 214 first time frames. Missing data in V are artificially different pitches for all notes in the piano piece, which also created by removing coefficients uniformly at random. equals the number of component K used by each system. Three systems are compared, based on the test error de- CNMF with S “ D performs better than NMF whatever fined as the β-divergence computed on the estimation of miss- the number of notes and the error increases by a small fac- ing data. In all of them, the number of component K is set to tor along the represented range. NMF performs about five the true number of notes available from the dataset annotation times worse for “easy” pieces, i.e., pieces composed of notes and we set β “ 1. The first system is the regular NMF, ran- with about 6 different pitches and it performs about 104 times domly initialized. The second system is the proposed CNMF o worse when the number of pitches is larger than 25. CNMF with pMS , S q fi pMV , Vq. The third system is the proposed CNMF with S “ D set as a specific matrix D of P “ 61 2The code of the experiments is available on the webpage of the MAD atoms. Each atom is a single-note spectrum extracted from project http://mad.lif.univ-mrs.fr/. with S “ V performs slightly better than NMF. Since the 7. REFERENCES number of components K increases equals the number of note pitches, those results confirm that NMF may highly suf- [1] C. Ding, Tao Li, and M.I. Jordan, “Convex and semi- fer from overtraining while CNMF may not, being robust to nonnegative matrix factorizations,” IEEE Trans. Pattern missing data even for large values of K. Anal. Mach. Intell., vol. 32, no. 1, pp. 45–55, Jan. 2010. [2] D.D. Lee and H.S. Seung, “Learning the parts of objects

103 by non-negative matrix factorization,” Nature, vol. 401, NMF ConvexNMF - S=D no. 6755, pp. 788–791, 1999.

102 ConvexNMF - S=V [3] E. Vincent, N. Bertin, and R. Badeau, “Adaptive har-

101 monic spectral decomposition for multiple pitch estima- tion,” IEEE Trans. Audio, Speech, Language Process., 100 vol. 18, no. 3, pp. 528–537, Mar. 2010. Test error

1 10− [4] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Trans. Pat-

2 10− tern Anal. Mach. Intell., vol. 35, no. 11, pp. 2765–2781, 2013. 3 10− 5 10 15 20 25 30 Number of note pitches (or K) [5] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proc of SIGGRAPH. ACM, 2000, Fig. 5. Test error vs. number of different note pitches for 80% pp. 417–424. missing data (each dot represents one piece of music). [6] P. Smaragdis, B. Raj, and M. Shashanka, “Missing data imputation for spectral audio signals,” in Proc. of MLSP, Grenoble, France, Sept. 2009. [7] J. Le Roux, H. Kameoka, N. Ono, A. De Cheveigne, 6. CONCLUSION and S. Sagayama, “Computational auditory induction as a missing-data model-fitting problem with bregman In this paper, we have proposed an extension of convex non- divergence,” Speech Communication, vol. 53, no. 5, pp. negative matrix factorization in the case where missing data, 658–676, 2011. as it has been previously presented for regular NMF. The pro- [8] D.L. Sun and R. Mazumder, “Non-negative matrix com- posed method can deal with missing values both in the data pletion for bandwidth extension: A convex optimization matrix V and in the dictionary S, which is particularly use- approach,” in Proc. of MLSP, Sept. 2013, pp. 1–6. ful in the case S “ V where the data is autoencoded. In this framework, an algorithm has been provided and analyzed [9]C.F evotte´ and J. Idier, “Algorithms for nonnegative ma- using a Majorization-Minimization (MM) scheme to guaran- trix factorization with the β-divergence,” Neural Com- tee the convergence to a local minimum. A large set of ex- put., vol. 23, no. 9, pp. 2421–2456, 2011. periments on synthetic data showed promising results for this [10] M. Nakano, H. Kameoka, J. Le Roux, Y. Kitano, variant of NMF for the task of reconstruction of missing data, N. Ono, and S. Sagayama, “Convergence-guaranteed and validated the value of this approach. In many situations, multiplicative algorithms for nonnegative matrix factor- CNMF outperforms NMF, especially when the ratio of miss- ization with β-divergence,” Proc. of MLSP, vol. 10, pp. ing values is high and when the matrix data V is noisy. This 1, 2010. trend has been confirmed on real audio spectrograms of piano music. In particular, we have shown how the use of a generic [11] A. Cutler and L. Breiman, “Archetypal analysis,” Tech- set of isolated piano notes as atoms could dramatically en- nometrics, vol. 36, no. 4, pp. 338–347, 1994. hance the robustness to missing data. [12]C.F evotte,´ N. Bertin, and J-L Durrieu, “Nonnegative This preliminary study indicates that it is worthy of fur- matrix factorization with the Itakura-Saito divergence: ther investigation, beyond the proposed settings where miss- With application to music analysis,” Neural Comput., ing values are uniformly distributed over the matrix. Further- vol. 21, no. 3, pp. 793–830, 2009. more, the influence of missing values in the dictionary has [13] V. Emiya, R. Badeau, and B. David, “Multipitch esti- not been completely assessed, as only the case where S “ V mation of piano sounds using a new probabilistic spec- has been taken into account. On the application side, this ap- tral smoothness principle,” IEEE Trans. Audio, Speech, proach could give new insight in many problems dealing with Language Process., vol. 18, no. 6, pp. 1643–1654, 2010. estimation of missing data. A. PROOFS

We detail the proofs of Propositions 1 and 2. The proof of Proposition 3 is straithforward using the same methodology. We first recall preliminary elementss from [9]. The β-divergence dβ px|yq can be decomposed as a sum of a convex term, a concave term and a constant term with respect to its second variable y as

dβ px|yq “ dβ px|yq ` dβ px|yq ` dβ pxq (12)

This decomposition is not unique. We will use decompositionq u given in [9, Table 1], for which we have the following derivatives w.r.t. variable y:

´xyβ´2 if β ă 1 1 β´2 dβ px|yq fi $y py ´ xq if 1 ď β ď 2 (13) &’yβ´1 if β ą 2 q β´1 %’y if β ă 1 1 dβ px|yq fi $0 if 1 ď β ď 2 (14) &’´xyβ´2 if β ą 2 u ’ A.1. Update of S (Proof of Proposition 1) % We prove Proposition 1 by first constructing the auxiliary function (Proposition 4 below) and then focussing on its minimum F ˆP (Proposition 5 below). Due to separability, the update of S P R` relies on the update of each of its columns. Hence, we only P derive the update for a vector s P R`.

N P ˆK KˆN N o P P Definition 2 (Objective function CS psq). For v P R` , L P R` , H P R` , m P t0, 1u , m P t0, 1u , s P R`, let us define

CS psq fi mn Cn psq ` Cn psq ` C (15) n ÿ ” ı q u 0where

T T o o Cn psq fi dβ vn| s LH n , Cn psq fi dβ vn| s LH n and C fi dβ pm.vq ` λdβ pm .s q (16)

` “ ‰ ˘ ` “ P ‰ ˘ Proposition 4 (Auxiliaryq q function GS ps|sq foru CS psq)u. Let s P R` be such that @n, vn ą 0 and @p, sp ą 0, where v fi T T s LH . Then the function GS ps|sq defined by r r r r r “ ‰ GS ps|sq fi mn Gn ps|rsq ` Gn ps|sq ` C (17) n ÿ ” ı r q r u r where

rLHs sp s G ps|sq fi pn d v |v p and G ps|sq fi d pv |v q ` d1 pv |v q rLHs ps ´ s q . (18) n v β n n s n β n n β n n pn p p p n ˆ p ˙ p ÿ r ÿ q r q r u r u r u r r is an auxiliary function for CSrpsq. r

Proof. We trivially have GS ps|sq “ CS psq. We use the separability in n and in p in order to upper bound each convex term Cn psq and each concave term Cn psq.

Convex term Cn psq. Let us prove that Gn ps|sq ě Cn psq. Let be the set of indices such that rLHspn ‰ 0 and define, for pqP , u P P q q r q rLHspn sp rLHspn sp λpn fi “ . (19) vn p1PP rLHsp1n sp1 r r r ř r r We have pPP λpn “ 1 and

ř P r rLHspn sp rLHspn sp Gn ps|sq “ λpndβ vn| ě dβ vn| λpn “ dβ vn| rLHspn sp “ Cn psq (20) pPP ˜ λpn ¸ ˜ pPP λpn ¸ ˜ p“1 ¸ ÿ ÿ ÿ q r r q q r q q Concave term Cn psq. We have Gn prs|sq ě Cn psq since Cn psq is concaver and s ÞÑ Gn ps|sq is a tangent plane to Cn psq in s:

u u r 1 u u u r u r Gn ps|sq “ dβ pvn|vnq ` dβ vn| rLHsp1n sp1 rLHspn psp ´ spq “ Cn psq ` Cn psq , s ´ s (21) p ˜ p1 ¸ ∇ ÿ ÿ u r u r u r r u r u r r

o o o Proposition 5 (Minimum of GS ps|sq). The minimum of s ÞÑ GS ps|sq subject to the constraint m .s “ m .s is reached at sMM with

r r γpβq m vβ´2v rLHs s n n n n pn if mo “ 0 MM p m vβ´1 LH p @p, sp fi $ ř n n n r spn (22) o ˆ r ˙ o &’sp ř if mp “ 1. r r ’ o o Proof. Since variable sp is fixed for p such that% mp “ 1, we only consider variables sp for p such that mp “ 0. The related o penalty term in GS ps|sq vanishes when mp “ 0. Using (13) and (14), the minimum is obtained by cancelling the gradient

s r G ps|sq “ m LH d1 v |v p ` d1 pv |v q (23) sp S n pn β n n s β n n ∇ n p ÿ „ ˆ ˙  r q r u r and by considering that the Hessian matrix is diagonal with nonnegative entriesr since dβ px|yq is convex:

2 LHpn 2 sp q GS ps|sq “ mnvn d vn|vn ě 0. (24) sp s β s ∇ n p p ÿ ˆ ˙ r r q r r r

A.2. Update of L (Proof of Proposition 2) We proove Proposition 2 by first constructing the auxiliary function (Proposition 6 below) and then focussing on its minimum (Proposition 7 below). As opposed to the update of S, no separability is considered here.

F F ˆP KˆN F ˆN P ˆK Definition 3 (Objective function CL plq). For v P R`, S P R` , H P R` , M P t0, 1u , L P R` , let us define

CL pLq fi mfn Cfn pLq ` Cfn pLq ` C (25) fn ÿ ” ı q u where

Cfn pLq fi dβ vfn| rSLHsfn , Cfn pLq fi dβ vfn| rSLHsfn and C fi dβ pM.Vq . (26) ´ ¯ ´ ¯ q q u u P ˆK Proposition 6 (Auxiliary function GL L|L for CL pLq). Let L P R` be such that @f, n, Vfn ą 0 and @p, k, Lpk ą 0, where V fi SLH. Then the function G´ L|¯L defined by L r r r r ´ ¯ r r r GL L|L fi mfn Gfn L|L ` Gfn L|L ` C (27) fn ´ ¯ ÿ ” ´ ¯ ´ ¯ı r q r u r where

s l h l G L|L fi fp pk kn d v |v pk (28) fn v β fn fn pk fn ˜ lpk ¸ ´ ¯ ÿ r q r q 1 r Gfn L|L fi dβ pvfnr |vfnq ` dβ pvfn|vfnr q sfphkn lpk ´ lpk (29) « pk ff ´ ¯ ÿ ´ ¯ u r u u r is an auxiliary function for CL pLq. r r

Proof. We trivially have GL pL|Lq “ CL pLq. In order to prove that GL L|L ě CL pLq, we use the separability in f and n and we upper bound the convex terms Cfn pLq and the concave terms Cfn´pLq.¯ r Convex term Cfn pLq. Let us prove that Gfn L|L ě Cfn pLq. Let be the set of indices such that sfp ‰ 0, be the set of q Pu K indices such that hkn ‰ 0 and define, for pp, k´q P ¯ˆ , q q rP Kq sfplpkhkn sfplpkhkn λfpkn fi “ . (30) vfn 1 1 1 1 pp1,k1qPPˆK sfp lp k hk n r r r ř We have pp,kqPPˆK λfpkn “ 1 and r r ř r sfplpkhkn Gfn L|L “ λfpkndβ vfn| (31) p,k PPˆK ˜ λfpkn ¸ ´ ¯ p qÿ q r r q P K sfpr lpkhkn ě dβ vfn| λfpkn “ dβ vfn| sfplpkhkn “ Cfn pLq (32) ¨ ˛ p,k PPˆK λfpkn ˜ p“1 k“1 ¸ p qÿ ÿ ÿ q ˝ r ‚ q q Concave term Cfn pLq. We have Gfn L|L ě Cfn pLqrsince Cfn pLq is concave and L ÞÑ Gfn L|L is a tangent plane to

Cfn pLq in L: ´ ¯ ´ ¯ u u r u u u r u r 1 Gfn L|L “ dβ pvfn|vfnq ` dβ vfn| sfp1 lp1k1 hk1n sfphkn lpk ´ lpk (33) pk ˜ p1k1 ¸ ´ ¯ ÿ ÿ ´ ¯ u r u D u E r r “ Cfn L r` Cfn L , L ´ L (34) ∇ ´ ¯ ´ ¯ u r u r r MM Proposition 7 (Minimum of GL L|L ). The minimum of L ÞÑ GL L|L is reached at L with ´ ¯ ´ ¯ γpβq r s m vβ´2v hr l fn fp fn fn fn kn if M ‰ 0 MM p,k s m vβ´1h @p, k, lp,k fi ř fn fp fn fn kn (35) $ ˆ r ˙ ’l otherwise.ř & p,k r Proof. Using (13) and (14), the minimum is obtained%’ by cancelling the gradient

1 lpk 1 lpk GL L|L “ mfnsfphkn dβ vfn|vfn ` dβ pvfn|vfnq (36) ∇ fn « ˜ lpk ¸ ff ´ ¯ ÿ r q r u r and by considering that the Hessian matrix is diagonal with nonnegative entriesr since dβ px|yq is convex:

s h l 2 G m v fp kn d2 v v pkq 0. lpk L L|L “ fn fn β fn| fn ě (37) ∇ fn lpk ˜ hpk ¸ ´ ¯ ÿ r r q r r r