Non-negative matrix factorization

Mathieu Blondel

NTT Communication Science Laboratories

2014/10/28

1 / 44 Outline

• Non-negative matrix factorization (NMF)

• Optimization algorithms

• Passive-aggressive algorithms for NMF

2 / 44 Non-negative matrix factorization

3 / 44 Non-negative matrix factorization (NMF)

n×d n×m Given observed matrix R ∈ R+ , find matrices P ∈ R+ m×d and Q ∈ R+ such that R ≈ PQ       r1,1 ··· r1,d p1,1 ··· p1,m q1,1 ··· q1,d        . .. .   . .. .   . .. .   . . .  ≈  . . .  ×  . . .        rn,1 ··· rn,d pn,1 ··· pn,m qm,1 ··· qm,d | {z } | {z } | {z } n×d n×m m×d m is a user-given hyper-parameter

PQ is called a low-rank approximation of R

4 / 44 Examples of non-negative data

The matrix R could contain...

• Number of word occurrences in text documents

• Pixel intensities in images

• Ratings given by users to movies

• Magnitude spectrogram of an audio signal

• etc...

5 / 44 Why imposing non-negativity of P and Q?

• Natural assumption if R is non-negative

• Each row of R is approximated by a strictly additive combination of factors / bases / atoms

m X [ru,1, ··· , ru,d ] ≈ pu,k × [qk,1, ··· , qk,d ] k=1 | {z } | {z } weight / activation factor / basis / atom

• P and Q tend to be sparse (have many zeros) ⇒ easy-to-interpret, part-based solution

6 / 44 Application 1: document analysis

• R is a collection of n text documents

• Each row [ru,1, ··· , ru,d ] of R corresponds to a document represented as a bag of words

• ru,i is the number of occurrences of word i in document u

• Factors [qk,1,..., qk,d ] in Q correspond to “topics”

• pu,k is the weight of topic k in document u

7 / 44 letters to nature letters to nature parts are likely to occur together. This results in complex depen- consequence of the non-negativity constraints, which is that dencies between the hidden variablesparts that are likely cannot to be occur captured together. by Thissynapses results inare complex either excitatory depen- consequence or inhibitory, of but the do non-negativity not change constraints, which is that algorithms that assume independencedencies in between the encodings. the hidden An alter- variablessign. that Furthermore,cannot be captured the non-negativity by synapses of are the either hidden excitatory and visible or inhibitory, but do not change native application of ICA is to transformalgorithms the that PCA assume basis images, independence to variables in the encodings. corresponds An to alter- the physiologicalsign. Furthermore, fact that the the firing non-negativity rates of of the hidden and visible make the images rather than the encodingsnative application as statistically of ICA indepen- is to transformneurons the PCA cannot basis be images, negative. to Wevariables propose corresponds that the one-sided to the physiological con- fact that the firing rates of 18 make the images rather than the encodings as statistically indepen- neurons cannot be negative. We propose that the one-sided con- dent as possible . This results in a basis that is non-global;18 however, straints on neural activity and synaptic strengths in the brain may be in this representation all the basisdent images as possible are used. This in results cancelling in a basisimportant that is non-global; for developing however, sparselystraints distributed, on neuralletters parts-basedactivity to and nature synaptic repre- strengths in the brain may be combinations to represent an individualin this representationface, and thus theall the encod- basis imagessentations are used for perception. in cancelling important for developing sparselyⅪ distributed, parts-based repre- ings are not sparse. In contrast, thecombinationsparts NMF are likely representation to to occur represent together. contains an This individual results in complex face, and depen- thusconsequence the encod- ofsentations the non-negativity for perception. constraints, which is that Ⅺ ingsdencies are between not sparse. the hidden In contrast,variables that the cannot NMF be representation captured by synapses contains are eitherletters excitatory or to inhibitory, nature but do not change both a basis and encoding that are naturally sparse, in that many of Methods bothalgorithms a basis that and assume encoding independence that are in the naturally encodings. sparse, An alter- in thatsign. many Furthermore, of Methods the non-negativity of the hidden and visible the components are exactlyparts equal arenative likely to zero. toapplication occur Sparseness together. of ICA This is in to results transformboth in the complex the PCAThe depen- basisfacial images, imagesconsequence used to invariables of Fig. the 1 consisted non-negativitycorresponds of frontal to theconstraints, views physiological hand-aligned which fact is that in that a the 19 firingϫ 19 grid. rates of the components are exactly equal to zero. Sparseness in both the The facial images used in Fig. 1 consisted of frontal views hand-aligned in a 19 ϫ 19 grid. basis and encodings is crucialdencies for between amake parts-based the the images hidden rather representation. variables than that the cannotencodings be capturedas statisticallyFor each by image,synapses indepen- the greyscale areneurons either intensities excitatory cannot were be or negative. first inhibitory, linearly We butscaled propose do so not that that change the the pixel one-sided mean and con- basis and encodings18 is crucial for a parts-based representation. For each image, the greyscale intensities were first linearly scaled so that the pixel mean and The algorithm of Fig. 2algorithms performsdent that asboth assume possible learning independence. This results and in in inference the a basis encodings. that is non-global; Anstandard alter- deviationsign. however, Furthermore, werestraints equal theto on 0.25, neuralnon-negativity and activity then clipped and of the synaptic to hidden the strengths range and [0,1].visible in the NMF brain was may be native applicationinThe this representation algorithm of ICA is to transform of allFig. the basis the 2 PCAperforms images basis are images,performed both used to in learning cancellingvariables with theand corresponds iterativeimportant inference algorithm to the for physiological developing describedstandard in deviation sparselyfact Fig. that 2, startingthe distributed, were firing equal with rates to parts-basedrandom of 0.25, and initial then repre- clipped to the range [0,1]. NMF was simultaneously. That is, it both learns a set of basis images and make thesimultaneously.combinations images rather than to represent the That encodings an is, individual it as both statistically face, learnsindepen- andconditions a thus set theneurons offor encod-W basisand cannotH images.sentations The be algorithm negative. and for perception. was Weperformed mostlypropose converged thatwith the the one-sided iterative after less algorithm than con- 50 iterations; describedⅪ in Fig. 2, starting with random initial infers values for the hiddendent variables as possibleings are18. from This not sparse. results the in In visible a contrast, basis that variables. isthe non-global; NMF representation however, straints contains on neural activity and synapticconditions strengths for W inand theH brain. The may algorithm be was mostly converged after less than 50 iterations; infers values for the hidden variableslettersthe to from results nature shown the visible are after variables. 500 iterations, which took a few hours of computationletters time to on nature a Although the generative modelin this representation ofboth Fig. a basis 3 andis all linear, theencoding basis the images thatinference are are naturally used in sparse, cancellingPentium in that IIimportant computer. many of PCAforMethods developing was done by sparsely diagonalizingthe results distributed, shown the matrix parts-based are afterVV 500T. The repre- iterations, 49 eigenvectors which took a few hours of computation time on a combinationsAlthough to represent the an generative individual face, model and thus of the Fig. encod- 3 is linear,sentations the for perception. inference Pentium II computer. PCA was doneⅪ by diagonalizing the matrix VVT. The 49 eigenvectors the components are exactly equal to zero.parts Sparseness arewith likely tothe occur in largest both together. eigenvalues the ThisThe results facial are in imagescomplexdisplayed. used depen- in VQ Fig. 1wasconsequence consisted done of via frontalof the views non-negativityk-means hand-aligned algorithm, constraints, in a 19 ϫ 19 which grid. is that computationparts are likely to isoccur nonlinear together. This owingresultsings in are complex to not the sparse. depen- non-negativity Inconsequence contrast, the ofthe NMF constraints. non-negativity representation constraints, contains which is that For each image, the greyscalewith intensities the largest were eigenvalues first linearly scaled are sodisplayed. that the pixel VQ mean was and done via the k-means algorithm, computationbasis and encodings is nonlinear is crucial for owing a parts-baseddencies to the betweenstarting representation. non-negativity the from hidden random variables constraints.thatinitial cannot conditions be captured for byWsynapsesand H. are either excitatory or inhibitory, but do not change Thedencies computation between the hidden is variables similar thatboth to cannot maximum a basis be capturedThe and encodingalgorithm by likelihoodsynapses that of are are Fig. eitherreconstruction naturally 2 performs excitatory sparse, or both inhibitory,algorithms in that learning many butthat do assume ofand not inferenceMethods independencechange instandard the encodings. deviation An were alter-starting equalsign. to 0.25, from Furthermore, and random then clipped the initial non-negativity to the conditions range [0,1]. of the for NMF hiddenW wasand andH. visible algorithms that assume independence in the encodings.The An computation alter- sign. Furthermore, is similar the non-negativity to maximum of the hiddenInlikelihood the and semantic visible analysisreconstructionperformed application with the of iterative Fig. 4, algorithm the vocabulary described in was Fig. 2, defined startingwith as the random 15,276 initial 19 the componentssimultaneously. are exactly That equal is, to it zero. both Sparseness learnsnative a set in application of both basis the of images ICAThe is facial to and transform images used the inPCA Fig. basis 1 consisted images, of to frontalInvariables the views semantic hand-aligned corresponds analysis in to a the 19 application physiologicalϫ 19 grid. fact of Fig.that the 4, firing the vocabulary rates of was defined as the 15,276 innative emission application tomography of ICA is to transform, the and PCA basis deconvolution images, to variables correspondsof blurred to the19 physiological astro-make thefactmost images that the frequent rather firing than rates words the of encodings inconditions the as database statistically for Wand ofindepen- GrolierH. The algorithmneurons encyclopedia cannot was mostly be articles, negative. convergedafter We after propose removal less than that 50 ofthe iterations; the one-sided con- basis andin encodings emission is crucial tomography for a parts-based, and representation. deconvolution18 For each of image, blurred the greyscale astro- intensities were first linearly scaled so that the pixel mean and make the images rather20,21 than the encodings as statisticallyinfers indepen- valuesneurons for the cannot hidden be negative. variables We proposedent from as thatpossible the the visible one-sided. This results variables. con- in a basis thatthe resultsis non-global; shown are however, aftermost 500 iterations,straints frequent on which neural words took activity in a fewthe and hours database synaptic of computation strengths of Grolier in time the encyclopedia brain on a may be articles, after removal of the 18 20,21 standard deviation were equal to 0.25, and then clipped to the range [0,1]. NMF was nomicaldent as possible images. This results. in a basis that isThe non-global; algorithm however, of Fig.straints 2 performs on neural activity both and learning synapticin and strengthsthis representation inference430 in most the brain common all may the be basis words, images such are used as in‘the’ cancelling and ‘and’.important Because for most developing words sparsely appear distributed,T in relatively parts-based repre- nomicalAlthough theimages generative. model of Fig. 3 is linear, the inferenceperformed with thePentium iterative II computer.algorithm described PCA430 was most indone Fig. common by 2, diagonalizing starting words, with random the suchmatrix initial asVV ‘the’. The and 49 eigenvectors ‘and’. Because most words appear in relatively in this representation all the basis imagessimultaneously. are used in cancelling That is,important it both for learns developing a set sparsely of basiscombinations distributed, imagesfew parts-based toarticles, and representthe repre- an wordindividual count face, matrix and thusV theis encod- extremelysentations sparse, for which perception. speeds up the algorithm. Ⅺ According to the generative modelcomputation of Fig. is 3, nonlinear visible owingvariables to the are non-negativity constraints.conditions for Wwithand H the. The largest algorithm eigenvalues was mostly are displayed.converged after VQwas less thandone 50 via iterations; the k-means algorithm, combinations to represent an individual face, and thus theAccording encod- sentations to the for perception. generative modelings are not of sparse. Fig. In 3, contrast, visibleⅪ the NMFvariables representation are containsfew articles, the word count matrix V is extremely sparse, which speeds up the algorithm. ings are not sparse. In contrast, the NMFinfers representation values contains for the hidden variables from the visible variables.The resultsthe shown results areshown afterstarting are after the 500from update iterations, random rules which initial of took conditions Fig. a few 2 hourswere for ofW iterated computationand H. 50 time times on a starting from generated from hidden variables byThe a network computation containing is similar to excitatory maximumboth likelihood a basis andreconstruction encoding that are naturally sparse, in that manyThe of results shown areT after the update rules of Fig. 2 were iterated 50 times starting from Although the generative model of Fig. 3 is linear, the inference Pentium II computer.In PCA the semanticwas doneby analysis diagonalizing applicationMethods the matrix of Fig.VV 4, the. The vocabulary 49 eigenvectors was defined as the 15,276 both a basis and encoding that are naturally sparse, ingenerated that many of fromMethods hidden19 variablesthe by components a networkrandom are initial exactly containing equal conditions to zero. excitatory Sparseness for W and inH both. the in emission tomography , and deconvolution of blurredwith astro- the largest eigenvaluesmost frequent are displayed. words inrandom theVQ databasewasThe done facial initial ofvia images Grolier the conditions usedk-means inencyclopedia Fig. 1 algorithm, consisted forarticles, ofW frontaland views afterH. hand-aligned removal of in thea 19 ϫ 19 grid. connections.the components are A exactly neural equal to network zero.computation Sparseness that in both is infers nonlinear the The the facial20,21 owing images hidden toused the in Fig. non-negativity 1from consisted of frontal thebasis views constraints. and hand-aligned encodings in a is 19 crucialϫ 19 grid. for a parts-based representation. For each image, the greyscale intensities were first linearly scaled so that the pixel mean and connections.nomical images A. neural network that infers thestarting hidden from random430 from most initial common conditions the words, for W suchand asH ‘the’. and ‘and’. Because most words appear in relatively basis and encodings is crucial for a parts-basedThe computationrepresentation. is similarFor each toimage, maximum the greyscale intensities likelihood were first linearlyreconstructionThe scaledalgorithm so that the of pixel Fig. mean 2 performs and both learning and inference standard deviation were equal to 0.25, and then clipped to the range [0,1]. NMF was visible variables requires the addition of inhibitorystandard deviation were feedback equal to 0.25, and con- then clipped toReceived the range [0,1]. 24 NMF May; was accepted 6 August 1999. The algorithm of Fig. 2 performs both learning andAccording inference to19 the generative model of Fig. 3, visible variablesIn the are semanticfew analysis articles, application the word of count Fig.Received 4, matrix theperformed vocabulary 24V May;is with extremely the accepted was iterative defined sparse, algorithm 6 August as which the described 15,276 1999. speeds in Fig. up 2, the starting algorithm. with random initial in emissionvisible tomography variablesperformed, and with requires the deconvolution iterative algorithm the described addition ofsimultaneously. blurred in Fig. 2, starting of astro- with inhibitory That random is, it initial both learns feedback a set of basis con- images and simultaneously. That is, it both learns a set of basis images and most frequent wordsThe in results the database shown ofare Grolier after the encyclopediaconditions update for rulesW articles,and ofH Fig.. The after 2 algorithm were removal iterated was of mostly the 50 converged times starting after less than from 50 iterations; generated20,21 fromconditions hidden for Wand variablesH. The algorithm by awas network mostlyinfers converged values containing after for less thethan 50 hidden excitatory iterations; variables from the visible variables. the results shown are after 500 iterations, which took a few hours of computation time on a nections. NMF learning is thennomical implemented images . through plasticity in 1. Palmer, S.430 E. most Hierarchical common words, structure suchin as ‘the’ perceptual and ‘and’. representation. Because most wordsCogn. appear Psychol. in relatively9, 441–474 (1977). infers values for the hidden variables from the visiblenections. variables. NMFthe results shown learning are after 500 iterations, is then which tookAlthough implemented a few hours the of computation generative through time model on a of Fig. plasticityrandom 3 is linear, initial the in conditions inference1. for Palmer,PentiumW and S. IIH computer. E.. Hierarchical PCA was done structure by diagonalizing in perceptual the matrix VVT representation.. The 49 eigenvectors Cogn. Psychol. 9, 441–474 (1977). connections. A neural network that infers the hiddenT fromfew articles, the the word count matrix V is extremely sparse, which speeds up the algorithm. theAlthough synaptic the generative connections. model of Fig. 3A isAccording full linear,the discussionthe inference to synaptic the generativePentium of connections. II such computer. model PCA a of was network Fig. done Aby 3, diagonalizing visible fullcomputation is discussion variables the matrix2.VVWachsmuth, is nonlinearare. The 49of eigenvectors owing such E., Oram, to the a non-negativityM. network W. & Perrett, constraints. is D. I.2. Recognition Wachsmuth,with the largest of E., eigenvaluesobjects Oram, and are M. displayed. their W. & component VQ Perrett, was done D. via I. parts:the Recognitionk-means algorithm, of objects and their component parts: computation is nonlinear owing to the non-negativitygeneratedvisible from constraints. hidden variableswith variables the requires largest eigenvaluesby a the network are addition displayed. containing VQ of wasThe inhibitory done computation viaexcitatory the kresponses-means feedback is algorithm, similarThe of tosingle results con- maximum shown unitsReceived inlikelihoodare the after 24 temporal the May; reconstruction update accepted cortex rules 6 August of of Fig.starting the 1999. 2 were macaque. from iterated randomCereb. 50 initial times conditions Cortex starting for4, fromW509–522and H. (1994). beyondThe computation the scopeis similar to of maximum this likelihood letter. reconstruction Here westarting only from random point initial conditions out for W theand H. random19 initial conditions for W and H. responsesIn the semantic of single analysis units application in the of Fig. temporal 4, the vocabulary cortex was of defined the macaque. as the 15,276 Cereb. Cortex 4, 509–522 (1994). beyondnections. NMF theIn thelearning scope semantic analysis is of then application this implemented of letter. Fig. 4,in the emission vocabulary Here through was tomography defined we plasticity as onlythe 15,276, and inpoint deconvolution out of blurred the astro- most frequent words in the database of Grolier encyclopedia articles, after removal of the 19 connections. A neural network that infers the hidden from3. Logothetis,20,21 the N. K. & Sheinberg,1. Palmer, D. S. L. E. HierarchicalVisual object structure recognition. in perceptualAnnu. representation. Rev. Neurosci.Cogn. Psychol.19, 9,577–621441–474 (1977). in emission tomography , and deconvolution of blurred astro- most frequent words in the database of Groliernomical encyclopedia images articles, after. removal of the 3. Logothetis,430 most common N. K. words, & Sheinberg, such as ‘the’ and D. ‘and’. L. Because Visual most object words recognition. appear in relativelyAnnu. Rev. Neurosci. 19, 577–621 20,21 visible variablesthe synaptic requires connections. the addition A of full inhibitory discussion feedback of such con- a networkReceived 24 isMay; accepted2. Wachsmuth, 6 August 1999. E., Oram, M. W. & Perrett, D. I. Recognition of objects and their component parts: nomical images . 430 most common words, such as ‘the’ and ‘and’. BecauseAccording most words to(1996). the appear generative in relatively model of Fig. 3, visible variables are (1996).few articles, the word count matrix V is extremely sparse, which speeds up the algorithm. responses of single units in theThe temporal results showncortex are of afterthe macaque. the updateCereb. rules of Cortex Fig. 2 were4, 509–522 iterated 50 (1994). times starting from According to the generative model of Fig.nections. 3, visiblebeyond NMFvariables learning are thefew scopeis articles, then the of implementedword this countletter. matrix V is through extremelyHeregenerated sparse, we plasticity which only from speeds hidden point in up the variables algorithm.1. out Palmer, the by S.a E. network Hierarchical containing structure in excitatory perceptual representation. Cogn. Psychol. 9, 441–474 (1977). The results shown are after the update rules of Fig. 2 were iterated4. Biederman, 50 times starting I. from Recognition-by-components:3. Logothetis, N. K. & Sheinberg, a4. theory Biederman,random D. of L. human Visual initial I. object conditions Recognition-by-components: image recognition. understanding.for W andAnnu.H. Rev.Psychol. Neurosci. a19, theory Rev.577–62194, of human image understanding. Psychol. Rev. 94, generated from hidden variables by a networkthe containing synaptic excitatory connections. A full discussion of suchconnections. a network A neural is 2. network Wachsmuth, that E., infers Oram, the M. W. hidden & Perrett, from D. I. the Recognition of objects and their component parts: random initial conditions for W and H. 115–147 (1987). (1996). connections. A neural network that infers the hidden from the visible variables requires theresponses addition of single of inhibitory units in the temporalfeedback cortex con- of115–147 theReceived macaque. 24 (1987). May;Cereb. accepted Cortex 6 August4, 509–522 1999. (1994). beyond the scope of this letter. Here we only point out the 4. Biederman, I. Recognition-by-components: a theory of human image understanding. Psychol. Rev. 94, visible variables requires the addition of inhibitory feedback con- Received 24 May; accepted 6 August 1999. nections.5. NMF Ullman, learning S.3. isHigh-Level Logothetis,then implemented N. Vision:K. & Sheinberg, through Object D. RecognitionL.plasticity Visual object in recognition.and VisualAnnu. Cognition Rev. Neurosci.(MIT19, 577–621 Press, Cambridge, MA, 115–147 (1987). 5. Ullman,1. Palmer, S. S. E.High-Level Hierarchical structure Vision: in perceptual Object representation. RecognitionCogn. Psychol. and9, Visual441–474 (1977). Cognition (MIT Press, Cambridge, MA, nections.Application NMF learning is then implemented through 1: plasticity in document1. Palmer, S. E. Hierarchical structure in perceptual representation. analysisthe synapticCogn. Psychol.connections.1996).9, 441–474 (1977).(1996). A full discussion of such a network is 2. Wachsmuth, E., Oram, M. W. & Perrett, D. I. Recognition of objects and their component parts: court president 5. Ullman, S. High-Level Vision:1996). Object Recognition and Visual Cognition (MIT Press, Cambridge, MA, the synaptic connections. A full discussion of such acourt network is 2. Wachsmuth,president E., Oram, M. W. & Perrett, D. I. Recognitionbeyond of objects the and scope their component of this4. parts: Biederman, letter. Here I. Recognition-by-components: we only point out a theorythe of humanresponses image of single understanding. units in the temporalPsychol. cortex Rev. of the94, macaque. Cereb. Cortex 4, 509–522 (1994). 6. Turk, M. & Pentland, A. Eigenfaces for recognition.3.J. Logothetis, Cogn. N. Neurosci. K. & Sheinberg,3, D. L.71–86 Visual object (1991). recognition. Annu. Rev. Neurosci. 19, 577–621 governmentbeyond the scopeserved of this letter. Here we only pointgovernmentcourt out the presidentresponsesserved of single units in the temporal cortex of the macaque. Cereb. Cortex 4, 509–522 (1994).115–147 (1987). 1996). 6. Turk, M. & Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71–86 (1991). 3. Logothetis,Encyclopedia N. K. & Sheinberg, D. L. Visual entry: object recognition. Annu. Rev. Neurosci. 19, 577–621Encyclopedia entry: (1996). government served 7. Field, D. J.5. What Ullman, is S. theHigh-Level goal6. Turk, ofVision: sensory M. Object& Pentland, Recognitioncoding? A. EigenfacesNeural and Visual for Comput. Cognition recognition.(MIT6, 559–601J. Press, Cogn. Cambridge, Neurosci. (1994).3, MA,71–86 (1991). council governor (1996). Encyclopedia entry: 7. Field,4. Biederman, D. J. What I. Recognition-by-components: is the goal of a sensory theory of human coding? image understanding.Neural Comput.Psychol. Rev. 94,6, 559–601 (1994). council governor'Constitution of the 'Constitution7. Field, of D. the J. What is the goal of sensory coding? Neural Comput. 6, 559–601 (1994). court councilpresident4.governor Biederman, I. Recognition-by-components: a theory of human image8. understanding.'Constitution Foldiak,Psychol. P. & Rev.1996).of Young, 94,the M. Sparse coding in the8. primate Foldiak,115–147 cortex. P. (1987). & Young,The Handbook M. Sparse of coding Brain Theory in the primate and cortex. The Handbook of Brain Theory and culture secretary governmentcultureserved secretary115–147secretary (1987).United States' 6. Turk, M. & Pentland,8. Foldiak, A. Eigenfaces P. & Young, for recognition. M. SparseJ.5. coding Cogn. Ullman, Neurosci. in S. theHigh-Level primate3, 71–86 Vision: cortex. (1991).ObjectThe Recognition Handbook and Visual of BrainCognition Theory(MIT Press, and Cambridge, MA, Encyclopedia entry:United States'United States' 1996). 5. Ullman, S. High-Level Vision: Object Recognition andcourt Visual Cognition (MITNeuralpresident Press, Cambridge, Networks7. Field, MA, D.895–898 J. What is the(MITNeural goalNetworks ofPress, sensory Cambridge,895–898 coding? Neural (MITNeural MA, Press,Comput. Networks 1995).Cambridge,6, 559–601895–898 MA, (1994). 1995). (MIT Press, Cambridge, MA, 1995). supreme senate council supremegovernor senate1996).senate 'Constitutiongovernment of servedthe 6. Turk, M. & Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71–86 (1991). court president 8. Foldiak, P. & Young,9. Olshausen, M. SparseEncyclopedia coding B. A. & in Field,entry: the primate D. J. Emergence cortex. The of Handbook simple-cell of receptive Brain Theory field and properties by learning a sparse culture constitutionalsecretary 6. Turk, M. & Pentland, A. Eigenfaces for recognition. J.council Cogn. Neurosci.9.3,president Olshausen,71–86governor (1991). (148) B. A. & Field, D. J. Emergence of simple-cell9. Olshausen,7. Field, receptive D. J. What B. A. is the &field goal Field, of properties sensory D. coding?J. EmergenceNeural by learning Comput. of6, simple-cell559–601 a sparse (1994). receptive field properties by learning a sparse constitutionalgovernment servedcongress Encyclopediaconstitutional entry: congresscongresspresident (148) United States' president (148)'Constitution of the 7. Field, D. J. What is the goal of sensory coding? Neuralculture Comput. 6, 559–601secretary (1994). Neural Networks 895–898code for (MIT natural Press, images. Cambridge,Nature MA,8.381, Foldiak, 1995).607–609 P. & Young, (1996). M. Sparse coding in the primate cortex. The Handbook of Brain Theory and council governor supreme'Constitutionrights senateof the presidential congresscode for natural(124) images. Nature 381,United607–609 States' (1996).code for natural images. Nature 381, 607–609 (1996). rights 8. Foldiak, P.congress & Young, M. Sparse coding(124) in the primatesupreme cortex. The Handbooksenate of Brain Theory9. Olshausen, and B. A.10. & Field, Lee, D. J.D. Emergence & Seung, ofH. simple-cell S. Unsupervised receptiveNeural Networks learning field895–898 properties by convex (MIT by Press, and learning Cambridge, conic a coding.sparse MA, 1995).Adv. Neural Info. Proc. culture secretarypresidential constitutionalUnitedrights States'congress presidential president (148) congress (124) 9. Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse justice electedNeural Networks 895–898 (MIT Press, Cambridge, MA,constitutional 1995). 10.power Lee,congress D. (120) D. &code Seung, for natural H. S. images. UnsupervisedNaturepresident381, 607–609 learning(148) 10. (1996). Lee, by convex D. D. & and Seung, conic H. coding. S. UnsupervisedAdv. Neural learning Info. Proc. by convex and conic coding. Adv. Neural Info. Proc. supreme senate Syst. 9, 515–521 (1997). code for natural images. Nature 381, 607–609 (1996). justice elected rights justicepresidential9. Olshausen,electedpower B. A. & Field, D.(120) J. Emergence of simple-cellcongressrights receptive (124) field propertiesunitedpresidential by (104) learning a sparsepower (120) congress (124) constitutional congress president (148) Syst. 9, 515–52110. Lee, D. (1997). D. & Seung,11. H.Paatero, S. Unsupervised P. Least squares learning formulation bySyst. convex10. Lee,9, and D.of515–521 D. robust conic & Seung, coding. non-negative H. (1997). S. UnsupervisedAdv. Neural factor learning Info. analysis. byProc. convexChemometr. and conic coding. Intell.Adv. Lab. Neural Info. Proc. justice code for natural images. Nature 381, 607–609 (1996).powerjustice (120) elected power (120) rights presidential congress (124)elected constitution (81) Syst. 9, 515–521 (1997). flowers 10.disease Lee, D. D. &united Seung, H. S. Unsupervised×(104) learning by convex≈ and conic coding. Adv. Neural Info.Syst. Proc.united9, 515–521 (104) (1997).37, 23–35 (1997). justice elected power (120) united (104)11. Paatero, P. Least squares formulationunited (104) of robust11. non-negative Paatero,11. Paatero, P. P. Least factor squares squares formulation analysis. formulation of robustChemometr. non-negative of robust factor Intell. analysis. non-negative Lab.Chemometr. Intell. factor Lab. analysis. Chemometr. Intell. Lab. Syst. 9, 515–521 (1997). amendment11. Paatero,(71) P. Least12. squares Nakayama, formulation K. & of Shimojo, robust non-negative S. Experiencing factor and analysis. perceivingChemometr. visual surfaces. Intell. Lab.Science 257, 1357–1363 × united (104)leaves ≈behaviourconstitution ×(81) flowers disease≈ ×constitution≈ (81)constitution (81) 37, 23–35 (1997). 11. Paatero,× P. Least squares formulation≈ of robust non-negativeconstitution factor analysis.(81)37,Chemometr.23–35 Intell. (1997). Lab. 37, 23–35 (1997). flowers disease flowers flowersdisease disease government 37,(57)23–35 (1997). (1992).amendment (71) 12. Nakayama, K. & Shimojo, S. Experiencing and perceiving visual surfaces. Science 257, 1357–1363 flowers disease × ≈ constitutionplant (81) glands37, 23–35 (1997). amendmentleaves (71)behaviour leaves behaviour amendment (71) 12. Nakayama,amendment K. & Shimojo, (71) S.government Experiencing (57) and perceiving(1992). visual surfaces. Science 257, 1357–1363 leavesleaves behaviourbehaviour amendmentleaves (71) 12. Nakayama,behaviour K. & Shimojo, S. Experiencing and perceivingplant visual surfaces.12.law Nakayama,glandsScience (49)257, 1357–1363 K. & Shimojo,13. Hinton,S. Experiencing G. E., Dayan, and P.,12. Frey, perceiving Nakayama, B. J. & Neal, visual R. K. M. & The surfaces.Shimojo, ‘‘wake-sleep’’Science S. Experiencing algorithm257, for1357–1363 unsupervised and perceiving neural visual surfaces. Science 257, 1357–1363 perennial contact government (57) law (49) 13. Hinton, G. E., Dayan, P., Frey, B. J. & Neal, R. M. The ‘‘wake-sleep’’ algorithm for unsupervised neural plantgovernmentglands (57) (1992). perennial contact (1992). plant glands government (57) (1992). governmentnetworks. (57)Science 268, 1158–1161(1992).networks. (1995).Science 268, 1158–1161 (1995). plant glands law (49)flowerplant 13.symptoms Hinton,glands G. E., Dayan, P., Frey, B. J. & Neal, R. M. Thelaw ‘‘wake-sleep’’ flower(49) algorithmsymptoms for unsupervised13. Hinton, neural G. E., Dayan, P., Frey, B. J. & Neal, R. M. The ‘‘wake-sleep’’ algorithm for unsupervised neural perennial contact perennial contact 14. Salton, G. & McGill, M. J. Introduction14. Salton, to G. &Modern McGill, M. Information J. Introduction Retrieval to Modern Information(McGraw-Hill, Retrieval New(McGraw-Hill, York, New York, plants skinnetworks. Sciencelaw268, (49)1158–1161 (1995). plants 13. Hinton,skin G. E.,networks.law Dayan, (49)Science P., Frey,268, 1158–1161 B. J. & Neal, (1995). R. M.13. The Hinton, ‘‘wake-sleep’’ G. E., Dayan, algorithm P., Frey, for B. J.unsupervised & Neal, R. M. neural The ‘‘wake-sleep’’ algorithm for unsupervised neural flower symptoms perennial contact 1983). perennial contact flower symptoms14. Salton, G. & McGill, M. J. Introduction to Modern Informationgrowing Retrieval (McGraw-Hill,pain New York, 1983). plants skin growing pain 14. Salton, G. & McGill, M. J. Introduction to Modern Information15. Landauer, Retrieval T. K. & Dumais,(McGraw-Hill, S. T. The latent New semantic York, analysis theory of knowledge. Psychol. Rev. 104, plants skin 1983). annual networks.infection Science 268, 1158–116115. Landauer, (1995).T. K. & Dumais, S.networks. T. The latentScience semantic268, analysis1158–1161 theory of knowledge. (1995). Psychol. Rev. 104, flowergrowing pain symptoms flower symptoms 211–240 (1997). annual 15.infection Landauer, T. K. & Dumais, S. T. The latent semantic analysis theory of knowledge. Psychol. Rev.1983).104, growing pain 14. Salton, G. & McGill, M. J. Introduction211–240 (1997). to Modern14. Salton, Information16. Jutten, G. C. & Herault, McGill, Retrieval J. Blind M. separation J.(McGraw-Hill,Introduction of sources, part I: to An New Modernadaptive York, algorithm Information based on Retrieval (McGraw-Hill, New York, annual infection 211–240 (1997). 15. Landauer, T. K. & Dumais, S. T. The latent semantic analysis theory of knowledge. Psychol. Rev. 104, plants skin plants skin neuromimetic architecture. Signal Proc. 24, 1–10 (1991). annual infection 16. Jutten, C. & Herault, J. Blind separation of sources, part I: An adaptive algorithm based on 16. Jutten, C. & Herault, J. Blind1983). separation of sources, part I: An adaptive algorithm based on 1983). 211–240 (1997). 17. Bell, A. J. & Sejnowski, T. J. An information maximization approach to blind separation and blind growing neuromimeticpain architecture. Signal Proc. 24, 1–10 (1991).metal process method paper ... glass copper leadneuromimetic steel architecture. Signal Proc. 24, 1–10 (1991). growing pain 16. Jutten, C. & Herault, J. Blind separation of sources, partdeconvolution. I: An adaptiveNeural algorithm Comput. 7, based1129–1159 on (1995). 17. Bell, A. J. & Sejnowski, T. J. An information maximization approach15. to Landauer,blind separation and T. blind K. & Dumais, S. T. The latent semantic15. Landauer, analysis T. theoryK. & Dumais, of knowledge. S. T. ThePsychol. latent semantic Rev. 104, analysis theory of knowledge. Psychol. Rev. 104, metal process method paper ... glass copper lead steel neuromimetic architecture.17. Bell, A.Signal J. & Sejnowski, Proc. 24, 1–10 T. J. (1991).An18. information Bartlett, M. S., maximization Lades, H. M. & Sejnowski, approach T. J. to Independent blind separation component representationsand blind for face annualmetal process deconvolution.methodinfection Neuralpaper Comput. ... glass7, 1129–1159 copper (1995). person lead steelexample time people ... rules lead leads law annual infection 211–240recognition. (1997).Proc. SPIE 3299, 528–539 (1998). 18. Bartlett, M. S., Lades, H. M. & Sejnowski, T. J. Independent component211–240 representations17. (1997). for faceBell, A. J. & Sejnowski,deconvolution. T. J. An informationNeural Comput.maximization7, 1129–1159 approach to(1995). blind separation and blind person example time people ... rules lead leads lawmetal process method paper ... glass copper lead steel 19. Shepp, L. A. & Vardi, Y. Maximum likelihood reconstruction for emission tomography. IEEE Trans. recognition. Proc. SPIE 3299, 528–539 (1998). deconvolution. Neural18. Bartlett, Comput. M.7, S.,1129–1159 Lades, H.16. (1995). M. Jutten, & Sejnowski, C. & T. Herault, J. Independent J. Blind component separation representations of sources, for face part I: An adaptive algorithm based on person example time people ... rules lead leadsFigure law 4 Non-negative16. Jutten, matrix C. factorization & Herault, (NMF) J. discovers Blind semantic separation features of of sources, partMed. Imaging. I: An2, adaptive113–122 (1982). algorithm based on 19. Shepp, L. A. & Vardi, Y. Maximum likelihood reconstruction for emission tomography. IEEE Trans. 18. Bartlett, M. S., Lades,recognition. H. M. & Sejnowski,Proc. SPIE T. J.3299, Independent528–53920. Richardson, component (1998). W. H. Bayesian-based representations iterative for method face of image restoration. J. Opt. Soc. Am. 62, 55–59 person example time peopleMed. Imaging.... rules2, 113–122 lead (1982).leads law m ¼ 30;991 articles from the Grolier encyclopedia. For each word in a vocabulary of size neuromimetic architecture. Signal Proc. 24, 1–10 (1991). Figure 4 Non-negative matrix factorization (NMF) discovers semantic features of neuromimetic architecture. Signal Proc. 24, 1–10 (1991).(1972). ; recognition. Proc.19. SPIE Shepp,3299, L.528–539 A. & Vardi, (1998). Y. Maximum likelihood reconstruction for emission tomography. IEEE Trans. m ¼ 30;991 articles from the Grolier encyclopedia. For each word in a vocabulary of size 20. Richardson, W. H. Bayesian-based iterative methodn of¼ image15 276, restoration. the numberJ. Opt. Soc. of occurrences Am. 62, 55–59 was counted in each article and used to form17. the Bell,21. Lucy, A. J.L. B. & An Sejnowski, iterative technique T. forJ. the An rectification information of observed distributions. maximizationAstron. J. 74, 745–754 approach to blind separation and blind Figuremetal 4 Non-negative process(1972). matrixmethod factorization paper (NMF) ... glass discovers copper semantic17. Bell, lead features A. J. 19.steel & Shepp,of Sejnowski, L. A. & Vardi, T. J.Med. Y.An Maximum Imaging. information likelihood2, 113–122 maximization reconstruction (1982). forapproach emission tomography. to blindIEEE separation Trans. and blind metal; process method paper ... glass copper lead steel 15;276 ϫ 30;991 matrix V. Each column of V contained the word counts for a particular (1974). n ¼ 15 276, the number of occurrences was counted in each article and used to form the 21. Lucy, L. B. An iterative technique for the rectification of observed distributions. Astron. J. 74, 745–754 deconvolution. Neural Comput. 7, 1129–1159 (1995). Figure 4 Non-negative matrix factorization (NMF) discovers semantic features ofdeconvolution.Med.Neural Imaging. 2, Comput.20.113–122 Richardson, (1982).7, 1129–1159 W. H. Bayesian-based (1995).22. Dempster, iterative A. method P., Laired, N.of M. image & Rubin, restoration. D. B. MaximumJ. Opt. likelihood Soc. from Am. incomplete62, 55–59 data via the EM 15;276 ϫ 30;991 matrix V. Each column of V contained the word countsm for¼ a30 particular;991 articles from the Grolier encyclopedia. Forarticle, each whereas word each in a row vocabulary of V contained of thesize counts of a particular word in different ; (1974). 20. Richardson, W. H. Bayesian-based(1972). iterative method18. Bartlett, of imagealgorithm. restoration.M.J.S., Royal Lades, Stat.J. Soc.Opt.39, H. Soc.1–38 M. Am. (1977). &62, Sejnowski,55–59 T. J. Independent component representations for face article, whereas each row of V contained the counts ofm a particular¼ 30 991 word articles in different; from the22. Grolier Dempster, encyclopedia. A. P., Laired, N. M. For & Rubin, each D. word B. Maximum inarticles. avocabulary likelihood The18. matrix from Bartlett, ofincomplete was size approximately data M. via theS., EM factorized Lades, into H. the M. form &WH Sejnowski,using the algorithm T. J. Independent component representations for face Using n = 30, 991n ¼person15 articles276, theexample number of time from occurrences people was Grolier counted... rules in eachlead article encyclopedia, leads and usedlaw to(1972). form the 21. Lucy, L. B. An iterative technique23. for Saul, the L.rectification & Pereira, F. Proceedings of observed of the Second distributions. Conference onAstron. Empirical J. Methods74, 745–754 n Natural Language person example time people ... rules; lead leads lawalgorithm. J. Royal Stat. Soc. 39, 1–38 (1977). described in Fig. 2. Upper left, four of the r ¼ 200 semantic features (columns of W). As recognition.Processing (edsProc. Cardie,SPIE C. & Weischedel,3299, R.)528–539 81–89 (Morgan (1998). Kaufmann, San Francisco, 1997). articles. The matrix was approximately factorized into then form¼ 15WH276,using the the number algorithm of occurrences was counted in each article and used to formrecognition. the 21. Lucy,Proc. L. B. SPIE An iterative3299, technique528–539 for the (1998). rectification of observed distributions. Astron. J. 74, 745–754 15;276 ϫ 30;99123. Saul,matrix L. & Pereira,V. Each F. Proceedings column of ofthe SecondV contained Conferencethey are onthe Empirical very word high-dimensional Methods counts n Natural for vectors, a Language particular each semantic feature(1974). is represented by a list of described in Fig. 2. Upper left, four of the r ¼ 200 semantic15;276 featuresϫ 30 (columns;991 ofmatrixW). AsV. EachProcessing column(eds of Cardie,V contained C. & Weischedel, the R.) word 81–89 counts (Morgan for Kaufmann, a particular San Francisco, 1997).(1974). 19. Shepp, L. A. & Vardi, Y. Maximum likelihood reconstruction for emission tomography. IEEE Trans. article, whereas each row of V contained the countsthe of eight a particular words19. with Shepp,word highest in L.frequency different A. & in Vardi, that feature. Y.22. Maximum The Dempster, darkness of A. likelihood the P., text Laired, indicates N. reconstruction M. & Rubin, D. B. Maximum for emission likelihood tomography. from incompleteIEEE data Trans. via the EM they are veryvocabulary high-dimensional vectors, each semantic size feature is representedd = by a list15 of , 276 and number of topics22. Dempster, A. P., Laired,m N.= M. & Rubin,200 D. B. MaximumAcknowledgements likelihood from incomplete data via the EM article, whereasFigure each 4 rowNon-negative of V contained the matrix counts factorization of a particularthe word (NMF) relative indifferent frequency discovers of each word semantic within a feature. features Right,algorithm. the ofeight mostJ. Royal frequent Stat. words Soc.Med.39, 1–38 Imaging. (1977). 2, 113–122 (1982). Figurethe eight 4 wordsNon-negative with highest frequency matrix in that feature.factorization The darkness (NMF)ofarticles. the text indicates discovers The matrix wassemantic approximately features factorized of into the form WHMed.using Imaging. the algorithmalgorithm.2, 113–122J. Royal Stat. (1982). Soc. 39, 1–38 (1977). We acknowledge the support of Bell Laboratories and MIT. C. Papageorgiou and T. Poggio articles. The matrix was approximatelyAcknowledgements factorized into the form WHandusing their the counts algorithm in the encyclopedia entry on the ‘Constitution23. Saul, of L. the & UnitedPereira, States’. F. Proceedings20. This Richardson, of the Second W. Conference H. Bayesian-based on Empirical Methods iterative n Natural method Language of image restoration. J. Opt. Soc. Am. 62, 55–59 the relative frequency of each word within a feature. Right, the eight mostmdescribed frequent¼ 30 words; in991 Fig. articles2. Upper left, from four the of theGrolierr ¼ 200 encyclopedia. semantic features20. For Richardson, each(columns word23. ofSaul, W.W in L.H.). As& a Bayesian-based Pereira, vocabulary F. Proceedings of iterative size of the Second method Conferenceprovided of on image Empirical us with restoration. the Methods database n of Natural faces,J. Opt. and Language R. Soc. Sproat Am. with the62, Grolier55–59 encyclopedia corpus. ; We acknowledge the support of Bell Laboratoriesword and countMIT. C. vector Papageorgiou was approximated and T. Poggio by a superpositionProcessing that gave high(eds weight Cardie, to the C. & Weischedel,We thank L. R.) Saul 81–89 for convincing (Morgan us Kaufmann, of the advantages San Francisco, of EM-type 1997). algorithms. We have m ¼and30 their counts991[Lee in articles the encyclopedia from & entry the Seung,on theGrolier ‘Constitutiondescribed encyclopedia. of thein United Fig. 99] States’. 2. Upper For This left,each four word of the r in¼ a200 vocabulary semantic features of size (columns of W). As Processing (eds Cardie, C. & Weischedel, R.) 81–89(1972). (Morgan Kaufmann, San Francisco, 1997). they are very high-dimensionalprovided us with the vectors, database of each faces, semanticand R. Sproatupper feature with two the semantic Grolier is represented(1972). features,encyclopedia and by corpus. none a list to the of lower two, as shown by the four shaded benefited from discussions with B. Anderson, K. Clarkson, R. Freund, L. Kaufman, word count vector was approximated8 / 44 by a superpositionthey that aregave very highn weighthigh-dimensional¼ 15 to the;276,We the vectors, thank number L. Sauleach for ofsemantic convincing occurrences feature us of the is advantages representedwas counted of EM-type by a inlistalgorithms. each of We article have and used to form the n ¼ 15;276, the number of occurrences was counted in each article and used to form thesquares in the middle indicating the activities of H. The bottom of the figure exhibits21. the Lucy,E. Rietman, L. B. An S. Roweis, iterative N. Rubin, technique J. Tenenbaum, forthe N. Tishby, rectification M. Tsodyks, of T. observedTyson and distributions. Astron. J. 74, 745–754 upper two semantic features, and none to the lower two, as shown bythe the eightfour shaded words withbenefited highest from discussions frequency with in B. that Anderson, feature. K. Clarkson, The darkness21. R. Freund, Lucy, of L. the L. Kaufman, B.text An indicates iterative technique for the rectificationM. of Wright. observed distributions. Astron. J. 74, 745–754 the eight words15;276 with highestϫ 30 frequency;991 matrix in thatV feature.. Each The column darkness oftwo ofV semantic thecontained text features indicates containing the word ‘lead’counts with high frequencies. forAcknowledgements a particular Judging from the other (1974). 15;276squaresϫ in the30 middle;991 indicating matrix the activitiesV. Each of H. The column bottom of of theV figurecontained exhibits the theE. Rietman, word S.counts Roweis, N. Rubin, for a J. Tenenbaum,particular N. Tishby, M. Tsodyks,(1974). T. TysonAcknowledgements and the relative frequencythe relative of frequency eachM. word Wright. of within each a word feature. within Right, a feature. the eight Right,words most in frequent the the eight features, words most twodifferent frequent meanings words of ‘lead’ are differentiated by NMF. Correspondence and requests for materials should be addressed to H.S.S. two semantic features containing ‘lead’ with high frequencies. Judgingarticle, from the other whereas each row of V contained the counts of a particularWe acknowledge word theinWe differentsupportacknowledge of Bell the Laboratories support22. Dempster, of and Bell MIT. Laboratories C.A. Papageorgiou P., Laired, andMIT. N. and M. C. T. & PapageorgiouPoggio Rubin, D. B. and Maximum T. Poggio likelihood from incomplete data via the EM article, whereas each row of V containedand their the countsand counts theirin the counts encyclopedia of a inparticular the encyclopedia entry on word the ‘Constitution entry in different on the of ‘Constitution the United States’.22. of the Dempster, This United States’. A. P., ThisLaired, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM words in the features, two different meanings of ‘lead’ are differentiated by NMF. Correspondence and requests for materials should be addressed to H.S.S. provided us with theprovided database us ofwith faces,© 1999 the and database Macmillan R. Sproatalgorithm. of faces, withMagazines the andJ. Grolier Royal R. Ltd Sproatencyclopedia Stat. with Soc. the39, Grolier corpus.1–38 encyclopedia (1977). corpus. NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 791 word countarticles.word vector count was The approximated vector matrix was approximated bywas a superposition approximately by a thatsuperposition gave factorized high thatweight gave into toalgorithm. the high the weight formWeJ.WH tothank Royal theusing L. Stat. SaulWe thefor Soc. thankconvincing algorithm39, 1–38 L. Saul us (1977). forof the convincing advantages us of of EM-type the advantages algorithms. of EM-type We have algorithms. We have articles. The matrix was approximately factorized© 1999 into Macmillan the form MagazinesWH using Ltd the algorithm 23. Saul, L. & Pereira, F. Proceedings of the Second Conference on Empirical Methods n Natural Language NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.comupper two semanticupper two features, semantic and features, none to the and lower none two, to the as shown lower two, by the as four shown23. shaded Saul, by the L. & fourbenefited Pereira,791 shaded from F. Proceedings discussionsbenefited withfrom of the B. discussions SecondAnderson, Conferencewith K. Clarkson, B. Anderson, on R. Empirical Freund, K. Clarkson, L. Methods Kaufman, R. Freund,n Natural L. Kaufman, Language described in Fig. 2. Upper left, four of the r ¼ 200 semantic features (columns of W). As Processing (eds Cardie, C. & Weischedel, R.) 81–89 (Morgan Kaufmann, San Francisco, 1997). described in Fig. 2. Upper left, four ofsquares the r in¼ thesquares200 middle semantic in indicating the middle the features indicating activities (columns of theH. activities The bottom of ofWH of.). theThe As figure bottom exhibits ofProcessing the the figureE. exhibits(eds Rietman, Cardie, the S. Roweis, C.E. &Rietman, Weischedel, N. Rubin, S. Roweis, J. Tenenbaum, R.) N.81–89 Rubin, N. (Morgan Tishby, J. Tenenbaum, M. Kaufmann, Tsodyks, N. Tishby, T. SanTyson Francisco,M. and Tsodyks, 1997). T. Tyson and they are very high-dimensional vectors, each semantic featureM. is Wright. representedM. Wright. by a list of they are very high-dimensional vectors,two semantic eachtwo semantic features semantic containing feature features ‘lead’ containing is represented with high ‘lead’ frequencies. with by high a Judging list frequencies. of from the Judging other from the other the eight words with highest frequency in that feature. The darkness of the text indicates the eight words with highest frequencywords in in that thewords features, feature. in the two The features, different darkness twomeanings different of of the ‘lead’ meanings text are indicates differentiated of ‘lead’ are by differentiated NMF. byCorrespondence NMF. andCorrespondence requests for materials and requestsAcknowledgements should for be materials addressed should to H.S.S. be addressed to H.S.S. the relative frequency of each word within a feature.Acknowledgements Right, the eight most frequent words the relative frequency of each word within a feature. Right, the eight most frequent words © 1999 Macmillan© 1999 Magazines Macmillan Ltd Magazines LtdWe acknowledge the support of Bell Laboratories and MIT. C. Papageorgiou and T. Poggio NATURE | VOLandNATURE 401 their| 21| VOL OCTOBER counts 401 | 21 in 1999 OCTOBER the| www.nature.com encyclopedia 1999 | www.nature.com entry on the ‘ConstitutionWe acknowledge of the the United support States’. of Bell This Laboratories and MIT. C. Papageorgiou791 and T. Poggio791 and their counts in the encyclopedia entry on the ‘Constitution of the United States’. This provided us with the database of faces, and R. Sproat with the Grolier encyclopedia corpus. word count vector was approximated by a superpositionprovided that usgave with thehigh database weight to ofthe faces, and R. Sproat with the Grolier encyclopedia corpus. word count vector was approximated by a superposition that gave high weight to the We thank L. Saul for convincing us of the advantages of EM-type algorithms. We have upper two semantic features, and none to the lowerWe two,thank as L. shown Saul for by theconvincing four shaded us of thebenefited advantages from of discussionsEM-type algorithms. with B. Anderson, We have K. Clarkson, R. Freund, L. Kaufman, upper two semantic features, and none to the lower two, as shown by the four shaded benefited from discussions with B. Anderson, K. Clarkson, R. Freund, L. Kaufman, squares in the middle indicating the activities of H. The bottom of the figure exhibits the E. Rietman, S. Roweis, N. Rubin, J. Tenenbaum, N. Tishby, M. Tsodyks, T. Tyson and squares in the middle indicating the activities of H. The bottom of the figure exhibits the E. Rietman, S. Roweis, N. Rubin, J. Tenenbaum,M. Wright. N. Tishby, M. Tsodyks, T. Tyson and two semantic features containing ‘lead’ with highM. frequencies. Wright. Judging from the other two semantic features containing ‘lead’ with high frequencies. Judging from the other words in the features, two different meanings of ‘lead’ are differentiated by NMF. Correspondence and requests for materials should be addressed to H.S.S. words in the features, two different meanings of ‘lead’ are differentiated by NMF. Correspondence and requests for materials should be addressed to H.S.S. © 1999 Macmillan Magazines Ltd NATURE | VOL 401 | 21 OCTOBER© 1999 1999 Macmillan| www.nature.com Magazines Ltd 791 NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 791 Application 2: image processing

• R is a collection of n images or image patches

• Each row [ru,1, ··· , ru,d ] of R corresponds to an image or image patch

• ru,i is the pixel intensity of pixel i in image u

• Factors [qk,1,..., qk,d ] in Q correspond to image “parts”

• pu,k is the weight of part k in image u ⇒ can be used as high-level feature descriptor

9 / 44 letters to nature letters to nature PCA constrains the columns of W to be orthonormal and the least one face, any given face does not use all the available parts. This rows of H to be orthogonal to each other. This relaxes the unary results in a sparsely distributed image encoding, in contrast to the PCA constrains the columns of W to be orthonormal and the least one face, any given face does not use all the available parts. This constraint of VQ, allowing a distributed representation in which unary encoding of VQ and the fully distributed PCA encoding7–9. rows of H to be orthogonal to each other. This relaxes the unary results in a sparsely distributed image encoding, in contrast to the each face is approximated by a linear combination of all the basis We implemented NMF with the update7–9 rules for Wand H given in constraint of VQ, allowing a distributed representation6 in which unary encoding of VQ and the fully distributed PCA encoding . each face is approximated by aimages, linear combination or eigenfaces . of A all distributed the basis encodingWe ofimplemented a particular NMF face is withFig. the 2. update Iteration rules of fortheseW updateand H given rulesletters converges in to to a local nature maximum images, or eigenfaces6. A distributedshown encoding next to the of a eigenfaces particular in face Fig. is 1. AlthoughFig. 2. Iteration eigenfaces of these have update a of rulesthe objective converges function to a local maximum statistical interpretation as the directions of largest variance, many letters to nature shown next to the eigenfaces in Fig.PCA 1. constrains Although theeigenfaces columns have of W a toof be the orthonormal objective function and the least one face, any givenn facem does not use all the available parts. This statistical interpretation as the directionsof them do ofnot largest have variance, an obvious many visual interpretation. This is F ¼ ½V logðWHÞ Ϫ ðWHÞ ÿ ð2Þ rows of H to be orthogonal to each other. This relaxes the unaryn m results in a sparsely distributedim image encoding,im inim contrast to the PCA constrains the columnsbecause of W to PCA be allows orthonormal the entries and of theW andleastH to one be face, of arbitrary any given sign. face does not use all the availableΑΑ parts. This of them do not have an obviousconstraint visual of interpretation. VQ, allowing a This distributed is representationF ¼ in which ½VunarylogðWH encodingÞ Ϫ ðWH of VQiÞ¼1 ÿ andm¼1 the fullyð distributed2Þ PCA encoding7–9. rowsbecause of PCAH to allows be orthogonal the entries toAs of eachW theand other. eigenfacesH to This be arerelaxes of arbitrary used the in unary linear sign. combinationsresults in a sparsely that generallyΑ distributedΑ im image encoding,im inim contrast to the each face is approximated by a linear combination of all thei¼ basis1 m¼1 We implemented NMF with the update rules for Wand H given in constraint of VQ, allowing a distributedinvolve complex representation cancellations in which betweenunary positive encoding and of negative VQ and thesubject fully distributed to the non-negativity PCA encoding constraints7–9. described above. This As the eigenfaces are used inimages, linear combinations or eigenfaces6.that A distributed generally encoding of a particular face is Fig. 2. Iteration of these update rules converges to a local maximum eachinvolve face complex is approximated cancellations by anumbers, linear between combination many positive individual of and all eigenfaces the negative basis lacksubjectWe intuitive implemented tomeaning. the non-negativity NMF withobjective the constraints update function rules described for canWand be above.H derivedgiven This in by interpreting NMF as a 6 shown next to the eigenfaces in Fig. 1. Although eigenfaces have a of the objective function images, or eigenfaces . A distributedNMF encoding does not of allow a particular negative face entries is inobjectiveFig. the 2. matrix Iteration function factors of theseW canand update bemethod derived rules converges for by constructing interpreting to a local a probabilistic NMF maximum as a model of image generation. numbers, many individual eigenfacesstatistical lack interpretation intuitive meaning. as the directions of largest variance, many shownNMF next does to not the allow eigenfaces negative inH entries Fig.. Unlike 1. Althoughin the the unarymatrix eigenfaces constraint factors W have ofand aVQ,methodof these the non-negativity objective for constructing function con- a probabilisticIn this model, model an of image imagen m pixel generation.Vim is generated by adding Poisson of them do not have an obvious visual interpretation. This is statistical interpretation as the directionsstraints permit of largest the combination variance, many of multiple basis images to repre- noise to the productF ¼ (WH)½mV. Thelog objectiveðWHÞ Ϫ functionðWHÞ ÿ in equation (2)ð2Þ H. Unlike the unary constraint of VQ, these non-negativity con- In this model, an imagen m pixel Vim is generated byΑ addingΑi im Poisson im im because PCA allows the entries of W and H to be of arbitrary sign. ¼ m¼ straintsof them permit do not the have combination an obvioussent of amultiple visual face. But interpretation. basis only images additive to combinations This repre- is noise are to allowed, the product because (WH the) m.is The then objective related function to thei likelihood1 in1 equation of generating (2) the images in V from As the eigenfaces are used in linear combinations thatF ¼ generally ½i V imlogðWHÞim Ϫ ðWHÞimÿ ð2Þ sentbecause a face. PCA But allows only additive the entries combinationsnon-zero of W and elementsH areto allowed, be of W arbitraryand becauseH are sign. the all positive.is then In related contrast to the toΑ PCA, likelihoodΑ the of basis generatingW and encodings the imagesH in. V from i¼1 m¼1 subject to the non-negativity constraints described above. This non-zeroAs the eigenfaces elements are of W usedand inHinvolveno linearare subtractions all combinationspositive. complex In can cancellations contrast occur. that generally Forto PCA, these between reasons,the positive basis theW non-negativityand and encodings negative H. The exact form of the objective function is not as crucial as the noinvolve subtractions complex can cancellations occur. Fornumbers,constraints these between reasons, many are positive compatible individual the non-negativity and eigenfaceswith negative the intuitive lacksubject intuitiveThe notion exact to meaning. the ofform combining non-negativity of the objectiveobjectivenon-negativity constraints function function constraints describedis not can as becrucial above.for derived the as success This the by of interpreting NMF in learning NMF parts. as a constraintsnumbers, many are compatible individual eigenfaceswithparts theNMF intuitive to lack does form intuitive not notiona whole,allow meaning. of negative which combining isentries how in NMFnon-negativityobjective the matrix learns function factorsa parts-based constraintsW canand be formethodA derived the squared success for by error constructing of interpreting NMF objective in learning a probabilistic function NMF parts. as can a model be optimized of image with generation. update partsNMF to does form not a whole,allow negative whichHrepresentation. is entries. Unlikehow in NMF the the unarymatrix learns constraint factorsa parts-basedW ofand VQ,Amethod these squared non-negativity for error constructing objective con- a probabilisticfunctionInrules this for canmodel,W modeland beoptimized anH ofdifferent image image pixel with generation.fromV updatei thosem is generated in Fig. 2(refs by adding 10, 11). Poisson These noise to the product (WH) . The objective function in equation (2) representation.H. Unlike the unary constraintstraints ofAs VQ, can permitbe these seen the non-negativity from combination Fig. 1, the con- ofNMF multiple basisrulesIn this and basis for model, encodingsW imagesand anH todifferent containimage repre- pixel fromupdateVi thosem is rules generated in yieldFig. 2 results (refs by adding 10, similarim 11). Poisson toThese those shown in Fig. 1, but have straintsAs can permit be seen the from combination Fig. 1, thesenta ofNMFlarge amultiple face. fraction basis But and basis only of encodings vanishing images additive to combinations contain coefficients, repre- noiseupdate so are both to allowed, rules the the product yield basis because results images (WH the similar)im.isthe The then technical to objectivethose related disadvantage shown function to the in likelihood Fig. in of equation 1, requiring but of havegenerating (2) the adjustment the images of a inparameterV from asent large a face. fraction But only of vanishing additive combinations coefficients,non-zeroand image elements encodings so are both allowed, of the areW basisand sparse. becauseH imagesare The the all basis positive.isthe images then technical In related are contrast sparse disadvantage to the to because PCA, likelihood of requiringthecontrolling of basis generating theW theand adjustment learning encodings the images of rate.H a inparameter. ThisV from parameter is generally adjusted letters to nature andnon-zero image elements encodings of areW and sparse.Hnotheyare The subtractions all are basis positive. non-global images Incan are contrast and occur. sparse containFor to because PCA, these several reasons,thecontrolling versions basis the ofW non-negativity mouths,theand learning encodings nosesletters rate.Hthrough. to ThisThe nature parameter exact trial form andis error, of generally the which objective adjusted can function be a time-consuming is not as crucial process as the if constraints are compatible with the intuitive notion of combining non-negativity constraints for the success of NMF in learning parts. theyno subtractions are non-global can and occur. contain Forand these several other reasons, versions facial parts, the of non-negativity mouths, where the nosesPCA various constrainsthrough theversionsThe columns exact trial of areW form andto in be different error,orthonormal of the which objective andthe the can matrixleast function be one a face,V time-consumingis any veryis given not face large. doesas crucial not Therefore, use process all the as available the ifthe parts. update This rules described in constraints are compatible withparts thePCA constrains intuitive to form the columns notiona whole, of W ofto which be combining orthonormalrows is of how andH to the be NMFnon-negativity orthogonalleast one learns face, to any each givena other. parts-based constraints face This does relaxes not use the all for the unaryA available the squaredresults success parts. in This aerror sparsely of NMF objective distributed in learning image function encoding, parts. incan contrast be to optimized the with update and other facial parts, where thelocationsrows various of H to be or orthogonal versions forms. to each The are other. variabilityin This different relaxes the of unary athe wholeresults matrix in face a sparsely isV generatedis distributed very imagelarge. by encoding, Therefore,Fig. in contrast 2 may to the the be update advantageous rules described for applications in 7–9 involving large data- constraint of VQ, allowing a distributed representation in which unary encoding7–9 of VQ and the fully distributed PCA encoding . locationsparts to form or forms. a whole, The variabilitywhichrepresentation.combiningconstraint is how of of a VQ, NMF whole allowingthese learns face a different distributed is a generated parts-based representation parts.each Although face by in whichis approximatedFig.Aunary squared all 2 encoding partsmay by a linear error ofbe are VQ combination advantageous and used objective the fully by of distributedat all the function basisrulesbases. for PCA applicationsencoding forWe can implementedW.and be optimizedH NMF involvingdifferent with the update with from large rules update those data- for Wand inH Fig.given in 2 (refs 10, 11). These each face is approximated by a linear combination of all the basis 6We implemented NMF with the update rules for Wand H given in representation. As can be seen6 from Fig. 1, theimages, NMF or eigenfaces basisrules. and A for distributed encodingsW and encodingH containdifferent of a particularfrom faceupdate is thoseFig. 2.rules Iteration in yield Fig. of these 2 results (refs update rules10, similar converges 11). toThese to those a local maximum shown in Fig. 1, but have combining these different parts.images, Although or eigenfaces all. A distributed parts encodingare used of ashown particular by atnext face to is thebases.Fig. eigenfaces 2. Iteration in Fig. of these 1. Although update rules eigenfaces converges have to a a localof maximum the objective function shown next to the eigenfaces in Fig. 1. Although eigenfaces have a of the objective function As can be seen from Fig. 1, thea largeNMF fraction basis and of encodings vanishing contain coefficients,statistical interpretationupdate so both as the rules directions the yield basis of largest results images variance, similar manythe technical to those disadvantage shown in Fig. of 1, requiring but have the adjustment of a parameter statistical interpretation as the directions of largest variance, many n m of them do not have an obvious visualn m interpretation. This is a large fraction of vanishing coefficients,of them do not haveso both an obvious the visual basis interpretation. images This isthe technical disadvantage of requiringcontrolling the the adjustmentF ¼ learning½V imlog ofrate.ðWH a parameterÞi Thism Ϫ ðWH parameterÞimÿ ð2 isÞ generally adjusted and image encodings are sparse. The basis images areF ¼ sparse½V becauseimlogðWHÞim Ϫ ðWHÞimÿ ð2Þ ΑΑ because PCA allows the entries of W andΑΑH to be of arbitrary sign. i¼1 m¼1 because PCA allows the entries of W and H to be of arbitrary sign. i¼1 m¼1 and image encodings are sparse.theyAs The the eigenfacesare basis non-global images are used in are linear and sparse combinations contain becauseAs that the several generally eigenfacescontrolling versions areOriginal used in of linear mouths,the combinations learning noses that rate. generallythroughFigure This 1 parameterNon-negative trial and ismatrixerror, generally factorization which adjusted can (NMF) be learns a time-consuming a parts-based representation process ifof they are non-global and containandinvolve several other complex versions facial cancellationsOriginal parts, of between mouths, where positive theinvolve noses and various negative complexFigurethroughsubject versionscancellations 1 toNon-negative the trialare non-negativitybetween and in positive different matrixerror, constraints andfactorization which negative describedthefaces, can matrix (NMF)subject above. whereas belearns This a toV time-consuming the vectorisa non-negativity very parts-based quantization large. constraints representation Therefore, (VQ) process and described principal theofif above. components update This rules analysis described (PCA) learn in numbers, many individual eigenfaces lack intuitivenumbers, meaning. manyfaces, individualobjective whereas eigenfaces function vector canlack beintuitive quantization derived meaning. by interpreting (VQ) andobjectiveNMF principal as function a components can be analysis derived by (PCA) interpreting learn NMF as a and other facial parts, where thelocationsNMF variousNMF does not or allow versions forms. negative entries The are in variability in the matrix differentNMF factors ofdoesW and a not wholethe allowmethod matrix negative facefor constructing entries isV generatedis in very thea probabilistic matrix large. factors by model Therefore,W ofFig.holisticand image 2method generation. representations. may the for be constructing update advantageous The a rules probabilistic three describedlearning for model applications methods of image in generation. were applied involving to a database large data- of H. Unlike the unary constraint of VQ, these non-negativity con- In this model, an image pixel V is generated by adding Poisson NMF H. Unlike the unary constraint of VQ, these non-negativity con-holisticIn this representations. model, an image pixel TheVim threeis generated learning by adding methods; Poisson were applied to aim database of locations or forms. The variabilitycombiningstraints of permit a whole the these combination face different of is multiple generated parts. basisstraints imagesletters Although by to permit repre- toFig. the naturenoise combination all 2 to partsmay the product of be are multiple (WH advantageous used) basis. The by objective images at to function repre-bases.m for¼ in applications2 equationnoise429 tofacial the (2) product images,involving (WH) eachm. The consisting objective large function of data-n ¼ in equation19 ϫ 19 (2) pixels, and constituting an ; im i combining these different parts.sent Although a face. But only additive all parts combinations are areused allowed,sent by a because face. at But the onlymbases.is¼ additive then2 429 related combinations facial to the likelihoodimages, are allowed, of each generating because consisting then theϫ imagesm ofismatrix then innV¼from related19V. Allϫ to19 threethe likelihoodpixels, find approximateand of generating constituting factorizations the images an in V from of the form V Ϸ WH, but with PCA constrains the columnsnon-zero of W to be elements orthonormal of andW theand leastH are one all face, positive. any given face In doescontrastnon-zero not use to all elements PCA,the availablethe of parts.W basis ThisandWHandare encodings all positive.H. In contrast to PCA, the basis W and encodings H. rows of H to be orthogonal to each other. This relaxes the unary results in a sparsely distributed image encoding, inn contrastϫ m to thematrix V. All three find approximate factorizations of the form V Ϸ WH, but with constraint of VQ, allowing a distributedno subtractions representation can in occur. which Forunary these encoding reasons, of VQ and the theno non-negativity fully subtractions distributed PCA can encodingThe occur.7–9 exact. For form these of thereasons, objective the function non-negativity is notthree as crucial differentThe as exact the types form of of constraints the objective onfunctionW and is notH, as crucial described as the more fully in the main text each face is approximated by aconstraints linear combination are ofcompatible all the basis withWe the implemented intuitive NMF notion with the of update combining rules for Wandnon-negativityH given in constraints for the success of NMF in learning parts. 6 constraints arethree compatible different with the types intuitive of constraints notion of combining on W andnon-negativityH, as described constraints more for fully the success in the of main NMF intext learning parts. images, or eigenfaces . A distributedparts encoding to form of a particular a whole, face which is Fig. is2. Iteration how NMF of these learns update rules a parts-based converges to a localA maximum squared error objective function can be optimizedand methods. with update As shown in the 7 ϫ 7 montages, each method has learned a set of shown next to the eigenfaces in Fig. 1. Although eigenfaces have a of the objective function parts to form a whole, which is how NMF learns a parts-based A squared error objective function can be optimized with update statistical interpretation as the directionsrepresentation. of largest variance, many andrules methods. for W and AsH different shown from in thethose 7 inϫ Fig.7 2Figure montages, (refs 10, 111).Non-negative These each method matrix has factorization learned a set (NMF) of learns a parts-based representation of n m representation. Original r 49rules basis for W images.and H different Positive from values those are in Fig. illustrated 2 (refs 10, with 11). black These pixels and negative values of them do not have an obviousAs visual can interpretation. be seen from This Fig. is 1, the NMF basis and encodings contain update rules yield results similar to those shown in¼ Fig. 1, but have F ¼ ½V imlogðWHÞim Ϫ ðWHÞimÿ ð2Þ Non-negative matrix factorization (NMF) learns a parts-based representation of because PCA allows the entries of W and H to be of arbitrary sign. ΑΑ As can be seenrFigure from¼ 49 Fig. 1 basis 1, theimages. NMF basis Positive and encodings values contain arefaces, illustratedupdate whereas rules with vector yield black results quantization pixels similar and to (VQ)negative those and shown principalvalues in Fig. 1, components but have analysis (PCA) learn a large fraction of vanishingOriginal coefficients, soi¼ both1 m¼1 the basis images the technical disadvantage of requiring the adjustmentwith ofred a parameter pixels. A particular instance of a face, shown at top right, is approximately As the eigenfaces are used in linear combinations that generally a large fractionfaces, of vanishing whereas coefficients, vector so quantization both the basis (VQ) images andthe principal technical componentsdisadvantage of requiring analysis the (PCA) adjustment learn of a parameter involve complex cancellationsand between imageNMF positive encodings and negative are sparse.subject The to basis the non-negativity images areand sparse constraints image because encodings describedwith above.controlling arered This sparse. pixels. the The learning A basis particular images rate. This are instance parameter sparse becauseof isholistic generally a face,controlling representations. adjusted shown the at learning topright, The rate. three is This approximately learning parameter methods is generally were adjusted applied to a database of numbers, many individual eigenfacestheylack are intuitive non-global meaning. and containobjective several function versions can be× of derived mouths, by interpreting noses through NMF as= a trial and error, which can be a time-consumingrepresented process by if a linear superposition of basis images. The coefficients of the linear NMF = they are non-global and contain several versions of mouths, noses through trial and error, which can be a time-consuming process if NMF does not allow negative× and entries other in the matrixfacial factors parts,W whereand method the various for constructing versions a probabilistic are in different model of imagerepresentedholisticthe generation. matrix representations.V byis very a linear large. Therefore, superposition The three the update learningm of rules¼ basis2 methods described;429 images. facial in were The images, applied coefficients each to a consisting database of the linear of ofn ¼ 19 ϫ 19 pixels, and constituting an H. Unlike the unary constraint of VQ, these non-negativity con- In this model, an image pixel Vim is generated by adding Poisson the matrix V is very large. Therefore, the update rules described in locations or forms. The variability of a whole faceand is generated other facial by parts,Fig. 2 where may be the advantageous various versions for applications are in different involvingsuperposition large data- are shown next to each montage, in a 7 ϫ 7 grid, and the resulting straints permit the combination of multiple basis images to repre- noise to the product (WH)im. The objective functionsuperpositionm in equation¼ 2; (2)429 facial are shown images, next each to eachconsisting montage, of n ¼ in a19 7 ϫ 197 grid, pixels, and and the constituting resulting an sent a face. But only additive combinationscombining are allowed, these different because the parts.is then Although related to the all likelihood partslocations are of generating used or by the forms. at imagesbases. Thein V from variability of a whole face is generatednsuperpositions byϫ mFig.matrix 2 mayV are be. All advantageous shown three on find the for approximate other applications side of factorizations involving the equality large data-of sign. the Unlike form V VQϷ andWH PCA,, but NMF with non-zero elements of W and H are all positive. In contrast to PCA, the basis W and encodings Hcombining. thesen differentϫ m matrix parts. AlthoughV. All three all parts find are approximate used by at factorizationsbases. of the form V Ϸ WH, but with no subtractions can occur. For these reasons, the non-negativity The exact form of the objective function is notsuperpositions as crucial as the are shown on the otherthree side of different the equality types sign. of constraints Unlike VQ on andW and PCA,H NMF, as described more fully in the main text constraints are compatible with the intuitive notion of combining non-negativity constraints for the success of NMF in learningletters parts. to nature learns to represent faces with a set of basis images resembling parts of faces. parts to form a whole, which is how NMF learns a parts-based A squared error objective function can be optimizedlearnsthree with update different to represent types faces of constraints with a set on ofW basisand imagesH, as described resembling more parts fully of in faces. the main text representation. rules for W and H different from those in Fig. 2 (refs 10,Figure 11). These 1 Non-negative matrix factorization (NMF)letters learns a parts-basedand to methods. representation nature As of shown in the 7 ϫ 7 montages, each method has learned a set of PCA constrains the columns of W to be orthonormal andOriginal the least one face, any given face does not use all the available parts. This As can be seen from Fig. 1, the NMF basisVQ and encodings contain update rules yield results similar to those shown inand Fig. 1, butmethods. have As shown in the 7 ϫ 7 montages, each method has learned a set of rows of H to be orthogonal to each other. This relaxes the unary results in a sparsely distributedfaces, image whereas encoding, vector in contrast quantization to the (VQ) and principal components analysisFigure (PCA) 1 Non-negative learn matrix factorization (NMF) learns a parts-based representation of VQ a large fraction of vanishing coefficients, so both the basis images the technical disadvantage of requiring the adjustment of a parameter Original r ¼ 49 basis images. Positive values are illustrated with black pixels and negative values constraintNMF of VQ, allowing a distributed representation in which unary encoding of VQ and the fully distributed PCA encoding7–9. letters to nature and image encodings are sparse. The basis images are sparse because controlling the learning rate. This parameter is generallyholistic adjusted representations. The three learning methods were applied to a databasefaces, whereas of vector quantization (VQ) and principal components analysis (PCA) learn each facePCA is approximated constrains by a the linear columns combination of ofW allto the bebasis orthonormalWe implemented and NMF ther with¼ theleast49 update one basis rules face, for images. anyWand givenH given face Positive in does not values use all theare available illustrated parts. with This black pixels and negative values they are non-global and contain several versions of mouths, noses through trial and error, which canNMF be a time-consumingm process¼ 2;429 if facial images, each consisting of n ¼ 19 ϫ 19 pixels, andholistic constituting representations. an The three learning methods were applied to a database of images, or eigenfaces6. A distributed encoding of a particular face is Fig. 2. Iteration of these update rules converges to a local maximum with red pixels. A particular instance of a face, shown at top right, is approximately and other facial parts, where therows various of H versionsto be are orthogonal in different the to matrix eachV other.is very large. This Therefore, relaxes the the update unary rules describedresults in in a sparsely distributed image encoding, in contrast to the shown next to the eigenfaces in Fig. 1. Although eigenfaces have a of thePCA objective constrains functionwithn theϫ redm columnsmatrix pixels.V. All of threeW A find particularto approximate be orthonormal factorizations instance and of theof the form a face,VleastmϷ ¼WH one2 shown,; but429 face, with facial7–9 any at images, given top eachright, face consisting does is not approximately of usen ¼ all19 theϫ 19 available pixels, and parts. constituting This an locations or forms. The variabilityconstraint of a whole of face VQ, is generated allowing by aFig. distributed 2 may be advantageous representation for applications in which involving largeunary data- encoding= of VQ and the fully distributed PCA encoding . statistical interpretation as the directions of largest variance, many × three different types of constraints on W and H, as describedrepresented more fully in the main by text a linear superposition of basis images. The coefficients of the linear combining these different parts.each Although face all is parts approximated are used by at = bybases. a linear combinationrows of all of theH to basisn bem orthogonalWe implemented to each other. NMF This with relaxes the update therules unary for Wresultsnandϫ mH inmatrixgiven a sparselyV. in All three distributed find approximate image factorizations encoding, of the in form contrastV Ϸ WH to, but the with of× them do not have an obvious visual interpretation. This is F ¼ represented½V logðWHÞ Ϫ ðWH byÞ ÿ a linearð2Þ superposition of basis images. The coefficients of the linear 7–9 6 constraint ofΑ VQ,Α iand allowingm methods.im a As distributed shownim in the 7 representationϫ 7 montages, each in method which has learnedunarythree a different set encoding of types of ofVQ constraints and the on W fullyand H distributed, as described PCA more fullyencoding in the main. text becauseimages, PCA allows or eigenfaces the entries of.W Aand distributedH to be of arbitrary encoding sign. of a particular facei¼1 m is¼1 Fig. 2. Iteration of these update rules convergessuperposition to a local maximum are shown next to each montage, in a 7 ϫ 7 grid, and the resulting r ¼ 49 basis images. Positive values are illustrated with black pixels and negative values As theshown eigenfaces next are to used the in eigenfaces linear combinations in Fig. that 1. generally Although eigenfaceseach face is have approximated asuperpositionof the objective by a linear are function combinationshown next of to all each the basis montage,andWe methods. implemented in a As 7 shownϫ 7 NMF in grid, the with7 ϫ and7 the montages, updatethe resulting each rules method for W hasand learnedH given a set ofin involve complexOriginal cancellations between positiveFigure 1 Non-negative and negative matrix factorizationsubject (NMF) to the learns non-negativity a parts-based representationwith6 constraints red pixels. of described A particular above. instance This of a face, shown at topsuperpositions right, is approximatelyr ¼ 49 basis images.are shown Positive values on the are illustrated other side with black of pixelsthe andequality negative sign.values Unlike VQ and PCA, NMF numbers,statistical many individual interpretation eigenfaces lack as the intuitivefaces, directions whereas meaning. vector quantization of largestobjective (VQ)images, and variance, principal function or components eigenfaces many can beanalysis derived (PCA). A learn distributed by interpreting encoding NMF as a of a particular face is Fig. 2. Iteration of these update rules converges to a local maximum × = represented by a linear superpositionn m of basis images. The coefficients of the linear NMF NMF does not allow negative entries in theholistic matrix representations. factors W Theand threemethod learningshown methods for constructing next were applied to thesuperpositionsa to probabilistic a database eigenfaces of model in of Fig. image are 1. generation. Althoughshown on eigenfaces the other have side a of ofwith the the red objective pixels.equality A particular function sign. instance Unlike of a face,VQ shownand atPCA, top right, NMF is approximately of them do not have an obvious visual interpretation. This is F ¼ ½V logðWHÞ Ϫ ðWHÞ ÿ ð2Þ H. Unlike the unary constraint of VQ, thesem ¼ non-negativity2;429 facial images, con- each consistingIn this model,of n ¼ 19 anϫ 19 image pixels, pixel andsuperposition constitutingVim is generated an are× shown by adding next Poisson to each montage,=im in a 7iϫm 7 grid,learns andim the to resulting represent faces with a set of basis images resembling parts of faces. because PCA allows the entries of W and H to bestatistical of arbitrary interpretation sign. as the directionsΑ ofΑ largest variance, many represented by a linear superpositionn m of basis images. The coefficients of the linear straints permit the combination of multiplen basisϫ m matrix imagesV. All to three repre- find approximatenoise toletters factorizations the product of the (WH to formlearns)imV nature.Ϸ ThesuperpositionsWH objective, but to with represent function are shown in equation oni the¼ faces1 otherm (2)¼1 side with of the a equality set sign. of basisUnlike VQ imagesand PCA, NMF resembling parts of faces. sentAs a face. the Buteigenfaces only additive combinations are used areinthree allowed, linear differentbecause types combinations of constraints the is on thenofW andthat relatedthemH, as described generally to do thelikelihood more not fully have in the of main generating an text obvious the images visual in V from interpretation. This is superposition are shown next to each montage, in a 7 ϫ 7 grid, and the resulting VQ learns to represent faces with a set of basis images resembling parts of faces. F ¼ ½V imlogðWHÞim Ϫ ðWHÞimÿ ð2Þ non-zero elements of W and H are all positive.and In methods. contrast As shown to PCA, in the 7 ϫ×the7montages, basis W and each encodingsmethod has learnedH. a set of = ΑΑ VQ × involve complex cancellations= between positivebecause andPCA negative allowssubject the entries to the of W non-negativityand H to be of constraints arbitrary sign. describedsuperpositions above. This are showni¼ on1 m the¼1 other side of the equality sign. Unlike VQ and PCA, NMF PCA constrains the columnsno subtractions of WVQto be can orthonormal occur. For andthese the reasons,leastr ¼ the49 one basis non-negativity face, images. any Positive given values face doesThe are illustratedexact not use form with all black the of theavailablepixels objective and negative parts. function values This is not as crucial as the rows of H to be orthogonalconstraints tonumbers, each other. are compatible This many relaxes individual with the the unary intuitive eigenfacesresultswith notion red pixels.in of a sparsely combining Alack particular intuitive distributed instancenon-negativity ofAs a meaning. imageface, the shown encoding, eigenfaces constraints at top right, in is contrast for approximately are theobjective success used to the of in NMF function linear in learning combinations can parts. be derived that by generally interpretinglearns NMF to represent as a faces with a set of basis images resembling parts of faces. constraint of VQ, allowingparts a× distributed toNMF form a doesrepresentation whole,= not which allow in is whichhow negative NMFunaryrepresented learns entries encoding a by parts-based a linear in ofthe superposition VQ andmatrixA the squaredinvolve of basisfully factorsVQ images. distributed error complex The objectiveW coefficientsand PCA function encoding of cancellationsmethod the linear can7–9. be for optimized constructing between with update positive a probabilistic and negativemodel of imagesubject generation. to the non-negativity constraints described above. This each face is approximatedrepresentation. by a linear combination of all the basis superpositionWe implemented are shown nextNMF to eachwithrules montage, the for updateW inand a 7 rulesHϫ different7 grid, for andWand fromthe resultingH thosegiven in Fig. 2 (refs 10, 11). These 6 H. Unlike the unary constraint of VQ, these non-negativitynumbers, many con- individualIn this eigenfacesmodel, an lack image intuitive pixel V meaning.is generated byobjective adding Poisson function can be derived by interpreting NMF as a images, or eigenfaces . A distributedAs can be encoding seen from of aFig. particular 1, the NMF face basisis Fig.superpositions and 2. encodings Iteration are shown contain of these on the update otherupdate side rules of rules the converges equality yield sign.results to Unlike a localsimilar VQ and maximum to PCA, those NMF shown in Fig. 1, but have im shown next to the eigenfacesa large instraints Fig. fraction 1. Although permit of vanishing eigenfaces the coefficients, combination have a sooflearns both the to of theobjective represent multiple basis faces imagesfunction with basis a setthe of imagestechnical basisNMF images disadvantage todoes resembling repre- not parts of allow of requiring faces.noise negative the to adjustment the product entries of a parameter in (WH the) matrixim. The objectivefactors W functionand method in equation for constructing (2) a probabilistic model of image generation. statistical interpretation as the directions of largest variance, many controlling the learning rate. This parameter is generally adjusted VQ and imagesent encodingsa face. But are sparse. only The additive basis images combinations are sparse because aren allowed,m H. Unlike because the the unaryis constraint then related of to VQ, the these likelihood non-negativity of generating con- the imagesIn this in model,V from an image pixel Vim is generated by adding Poisson of them do not have anthey obvious are non-global visual interpretation. and contain severalThis is versions of mouths, noses through trial and error, which can be a time-consuming process if F ¼ ½V mlogðWHÞ m Ϫ ðWHÞ mÿ ð2Þ Applicationbecause PCA allows the entriesandnon-zero other of W and facialH parts,to elements be 2:where of arbitrary the of variousW sign. imageand versionsH are are all in positive. differentΑΑ Inprocessingthestraints contrasti matrix V isi permit to very PCA, large.i the Therefore, combinationthe basis the updateW and of rules multiple encodings described in basisH. images to repre- noise to the product (WH)im. The objective function in equation (2) PCA i¼1 m¼1 PCA As the eigenfaces are usedlocations inno linearsubtractions or combinations forms. The variability that can generally occur. of a whole For× face these is generated reasons, by =Fig. thesent 2 non-negativity may a face.be advantageous But only for additive applicationsThe exact combinations involving form largeof the data- are objective allowed, function because isthe not asis then crucial related as the to the likelihood of generating the images in V from involve complex cancellationscombiningconstraints between these positive different are compatible andparts. negative Although with allsubject parts the to are intuitive the used non-negativity by at notionbases.non-zero constraints of combining elements described above. ofnon-negativityW Thisand H areconstraints all positive. for In the contrast success to of PCA, NMF inthe learning basis W parts.and encodings H. numbers, many individual eigenfaces lack intuitive meaning. objective function can be derived by interpreting NMF as a NMF does not allow negativeparts entries toin the form matrix a factors whole,W and whichmethod is how for constructing NMF learns a probabilisticno a subtractions parts-based model of image can generation.A occur. squared For× error these objective reasons, function= the non-negativity can be optimizedThe with exact update form of the objective function is not as crucial as the H. Unlike the unary constraint× representation. of VQ, these non-negativity= con- In this model, an image pixelconstraintsVim is generated are by compatible addingrules Poisson for withW and theH intuitivedifferent notion from those of combining in Fig. 2 (refsnon-negativity 10, 11). These constraints for the success of NMF in learning parts. Original Figure 1 Non-negative matrix factorization (NMF) learns a parts-based representation of straints permit the combination of multiple basis images to repre- noise to the product (WH)im. The objective function in equation (2)= As can be seen from Fig. 1, the NMF basis and×faces, encodings whereas vector contain quantization (VQ)update and principalrules components yield analysis results (PCA) learn similar to those shown inA Fig. squared 1, but errorhave objective function can be optimized with update sent a face. But only additive× combinationsPCA are allowed, because the =is then related to the likelihoodparts of generating to form the imagesa whole, in V from which is how NMF learns a parts-based NMF non-zero elements of W and Haare large all positive. fraction In contrast of vanishing to PCA, the coefficients, basis W and encodings so bothholisticHrepresentation.. the representations. basis images The three learningthe methods technical were applied disadvantage to a database of of requiring the adjustmentrules of a for parameterW and H different from those in Fig. 2 (refs 10, 11). These ; no subtractions can occur. For these reasons, the non-negativity The exact form of the objectivem ¼ 2 429 function facial images, is noteach asconsisting crucial of n as¼ the19 ϫ 19 pixels, and constituting an PCA and image encodings are sparse. The basis imagesn ϫ arem matrixAs sparse canV. All becausethree be findseen approximate fromcontrolling factorizations Fig. 1,the of the the formNMF learningV WH basis, but with andrate. encodings This parameter contain is generallyupdate adjustedrules yield results similar to those shown in Fig. 1, but have constraints are compatible with the intuitive notion of combining non-negativity constraints for thePCA success of NMF in learning parts. Ϸ three different types of constraints on W and H, as described more fully in the main text parts to form a whole, whichthey is how are NMF non-global learns a parts-based and containA squared several error versions objective function ofa large mouths, can fraction be optimized noses of with vanishingthrough update trial coefficients, and error, so which both thecan basis be a time-consuming images the technical process disadvantage if of requiring the adjustment of a parameter and methods. As shown in the 7 ϫ 7 montages, each method has learned a set of representation. and other facial parts, whererules the for variousW and H different versions from are those in in different Fig. 2 (refs 10,the 11). These matrix V is very large. Therefore, the update rulescontrolling described the in learning rate. This parameter is generally adjusted r ¼and49 basis image images. Positive encodings values are illustrated are sparse. with black pixelsThe and basis negative images values are sparse because As can be seen from Fig. 1, the NMF basis and encodings contain update rules yield results similar to those shown in Fig. 1, but have locations or forms. The variability of a whole facewiththey red is pixels. generated are A particular non-global instance by of aFig. and face, shown 2 contain may at topbe right, several advantageous is approximately versions for of applicationsmouths, noses involvingthrough large trial data- and error, which can be a time-consuming process if a large fraction of vanishing coefficients, so both the basis images the technical disadvantage of requiring the adjustment of a parameter × = represented by a linear superposition of basis images. The coefficients of the linear and image encodings are sparse.combining The basis images these are sparse different because parts.controlling Although the learning allparts rate.and This are otherparameter used facial by is generally at parts,bases. adjusted where the various versions are in different the matrix V is very large. Therefore, the update rules described in superposition are shown next to each montage, in a 7 ϫ 7 grid, and the resulting they are non-global and contain severalPCA versions= of mouths, noses through trial and error, which can be a time-consuming process if PCA ×× =× =×superpositionslocations are shown or on forms. the other side The of the variability equality= sign. Unlike of VQ and a whole PCA, NMF face is generated by Fig. 2 may be advantageous for applications involving large data- and other facial parts, where the various versions are in different the matrix V is very large. Therefore, the update rules described in learns to represent faces with a set of basis images resembling parts of faces. locations or forms. The variability of a whole face is generated by Fig. 2 may be advantageouscombining for applications these involving different large data- parts. Although all parts are used by at bases. combining these different parts.VQ Although all parts are used by at bases. Original Figure 1 Non-negative× matrix factorization= (NMF) learns a parts-based representation of faces, whereas vector quantization (VQ) and principal components analysis (PCA) learn © 1999 Macmillan Magazines Ltd NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.comNMF holistic789 representations. The three learning methods were applied to a database of Original Figure 1 Non-negative matrix factorization (NMF) learns a parts-based representation of Original Figure 1 Non-negative matrix factorization (NMF) learns a parts-based representation of faces, whereas vector quantization (VQ) and principal components analysism (PCA)¼ 2 learn;429 facial images, each consisting of n ¼ 19 ϫ 19 pixels, and constituting an © 1999 Macmillan Magazines Ltd faces, whereas vector quantization (VQ) and principal components analysis (PCA) learn NMF NATURE | VOL 401 | 21 OCTOBER 1999holistic| www.nature.com representations. The three learning methods were applied to a databasen ϫ m of matrix V. All three find approximate factorizations of the form V Ϸ WH,789 but with × m ¼=2;429 facial images, each consisting ofNMFn ¼ 19 ϫ 19 pixels, and constituting an holistic representations. The three learning methods were applied to a database of n ϫ m matrix V. All three find approximate factorizations of the form V ϷthreeWH, but different with types of constraints on W and H, as described morem fully¼ in2; the429 main facial text images, each consisting of n ¼ 19 ϫ 19 pixels, and constituting an three different types of constraints on W and H, as described more fully inand the main methods. text As shown in the 7 7 montages, each method has learned a set of ϫ © 1999 Macmillann ϫ mMagazinesmatrix V. All threeLtd find approximate factorizations of the form V Ϸ WH, but with and methods. As shown in the 7 ϫ 7NATURE montages, each| VOL method 401 has| 21 learned OCTOBERr a¼ set49 of basis 1999 images.| www.nature.com Positive values are illustrated with black pixels and negative values 789 PCA r ¼ 49 basis images.© Positive 1999 values are illustratedMacmillan with black pixels and negative Magazines values © Ltd 1999 Macmillan Magazinesthree different types ofLtd constraints on W and H, as described more fully in the main text NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.comNATURE | VOL 401 | 21with OCTOBER red pixels. A particular 1999 instance| ofwww.nature.com a face, shown at top right, is approximatelywith red pixels. A particular instance of a face, shown at top right,and is approximately methods. As shown in the 7 ϫ 7 montages, each method has789 learned a set of 789 ×× = =represented× by a linear superposition=× of basis images. The coefficients ofrepresented the linear = by a linear superposition of basis images. The coefficients of the linear r ¼ 49 basis images. Positive values are illustrated with black pixels and negative values superposition are shown next to each montage, in a 7 ϫ 7 grid, and the resultingsuperposition are shown next to each montage, in a 7 ϫ 7 grid, and the resulting superpositions are shown on the other side of the equality sign. Unlike VQ and PCA, NMF with red pixels. A particular instance of a face, shown at top right, is approximately superpositions are shown on the other side of the equality sign. Unlike VQ and PCA, NMF learns to represent faces with a set of basis images resembling parts of faces. × = represented by a linear superposition of basis images. The coefficients of the linear VQ learns to represent faces with a set of basis images resembling parts of faces. × = superposition are shown next to each montage, in a 7 ϫ 7 grid, and the resulting VQ superpositions are shown on the other side of the equality sign. Unlike VQ and PCA, NMF learns to represent faces with a set of basis images resembling parts of faces. VQ

× = © 1999 Macmillan Magazines Ltd Using n =NATURE2,| VOL429 401 | 21 OCTOBERface 1999 | www.nature.com images, d = 19 × 19 pixels789 and © 1999 Macmillan Magazines© Ltd 1999 Macmillan Magazines Ltd NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.comNATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789 789 m =PCA 49 basis images [Lee× &= Seung, 99] 10 / 44 × =

PCA × =

PCA

© 1999 Macmillan Magazines Ltd NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789 × =

× =

© 1999 Macmillan Magazines Ltd NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789

© 1999 Macmillan Magazines Ltd NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789 2.1. Framework 11

For instance, in the case of movies, Alice might have given the rating of 1 (out of 5) to the movie Avatar. Note however that r can be any arbitrary function. What makes the recommendation problem so difficult is that r is not defined on the whole U I space, but only on some subset of it. In other words, the challenge ⇥ behind making recommendations is to extrapolate r to the rest of that space. To make things worse, the size of the subspace where r is known is usually very small Applicationin comparison3: with the recommendation size of the unknown region. Yet, recommendations systems should be useful even when the system includes a small number of examples. In addition, the size of U and I might range from a few hundreds of elements to millions in some • R is aapplications.partially For scalability observed reasons,rating this shouldn’t matrix be lost of sight. Avatar Escape From Alcatraz K-Pax Redemption Shawshank Suspects Usual Alice 1 2 ? 5 4 Bob 3 ? 2 5 3 R(u) Clint ? ? ? ? 2 Dave 5 ? 4 4 5 Ethan 4 ? 1 1 ?

ru,i 11 / 44 R(u,i) R(i)

Figure 2.1: Example of rating matrix R

In case of ratings, r can be represented as a matrix R, as depicted in figure 2.1. In that case, the recommendation problem boils down to predict unknown values of R. The set of ratings given by some user u will be represented by an incomplete array R(u), while the rating of u on some item i will be denoted R(u, i). Note that this value may be unknown. The subset of items i I actually rated by u is (u). 2 I The number of items in that set is denoted (u) . Similarly, the set of ratings given |I | to some item i will be represented by an incomplete array R(i). The subset of users u U which have actually rated i is noted (i). The number of items in that set is 2 U Application 3: recommendation systems

• Users and movies are projected in a common m-dimensional2.3. Popular algorithms latent space [Louppe, 2010] 21

Drama

Ed

Alice

Female-oriented Male-oriented

Bob

Clint

Comedy

12 / 44 Figure 2.4: The latent factor approach. Movies and users are projected into a space whose dimensions measure the characteristics of movies and user’s interest in these characteristics. In this simplified example, movies and users are projected along two axis. The first axis corresponds to whether a movie is more female- or male- oriented while the second corresponds to whether a movie is more a drama or a comedy. It also shows where users lie in that space. From this example, one might say that Alice would love Pride and Prejudice while she’d hate Beverly Hills Cop. Note that some movies, such a SiCKO, or some users, such like Bob, might be seen as fairly neutral with respect to these dimensions

item interactions are modelled as inner products in that space [37]. In other words, each user u is associated to a vector of factors P (u) Rf while each item i is asso- 2 ciated to a vector of factors Q(i) Rf . The elements of Q(i) measure the level of 2 expression of the corresponding features in item i while the elements of P (u) mea- sure the interest of u in these characteristics. The dot product of these two vectors represents the overall interest of user u in item i, which is nothing else than R(u, i):

R(u, i)=P (u)T Q(i) (2.13)

In that context, learning a latent factor model boils down to search for the best mapping between the user-item space and Rf . In matrix terms, P (u) and Q(i) can respectively be aggregated into a matrix P of format U f and a matrix Q of | | ⇥ Application 3: recommendation systems

• Inner product in this space can be used to predict missing values

  q1,i    .  ru,i ≈ [pu,1,..., pu,m]  .  | {z }   pu qm,i | {z } qi

13 / 44 Optimization algorithms

14 / 44 Formulating an optimization problem

n×m m×d How do we find P ∈ R+ and Q ∈ R+ such that R ≈ PQ

?

15 / 44 Formulating an optimization problem

Using Euclidean distance:

1 λ  minimize F (P, Q) = kR − PQk2 + kPk2 + kQk2 P≥0,Q≥0 2 2 | {z } | {z } error term regularization term

Non-convex in P and Q jointly Convex in P or Q separately ⇒ we can alternate between updating P and Q

16 / 44 Formulating an optimization problem

Using generalized KL divergence, a.k.a. I-divergence:

  λ 2 2 minimize F (P, Q) = DI (R||PQ) + kPk + kQk P≥0,Q≥0 | {z } 2 error term | {z } regularization term

X Au,i where DI (A||B) = Au,i log( ) − Au,i + Bu,i . u,i Bu,i

When λ = 0, equivalent to MLE solution assuming ru,i ∼ Poisson((PQ)u,i ) [F´evotte,2009] 17 / 44 Two kinds of sparsity

• Sparsity of non-zero entries 1 0 3   R = 0 2 0   0 0 1 • Sparsity of observed values 1 ? 3   R = ? 2 ?   ?? 1 • These two settings require different algorithm design and implementation 18 / 44 Multiplicative method

• Euclidean distance, no regularization [Lee & Seung, 2001]

T T (RQ )u,k (P R)k,i Pu,k ← Pu,k T Qk,i ← Qk,i T (PQQ )u,k (P PQ)k,i

• Similar updates for generalized KL divergence case

• Guarantees that the objective is non-increasing... [Lee & Seung, 2001]

• ...but not convergence [Lin, 2007]

19 / 44 Projected gradient method

• Gradient step followed by a truncation [Lin, 2007]

  P ← max P − η∇P F (P, Q), 0   Q ← max Q − η∇QF (P, Q), 0

• η can be fixed to a small constant or adjusted by line search

• Converges to a stationary point of F

20 / 44 Projected stochastic gradient method

• Objective with missing values:

1 X 2 minimize F (P, Q) = (r u,i − pu · qi ) + P≥0,Q≥0 2|Ω| (u,i)∈Ω λ  kPk2 + kQk2 2 where Ω is the set of observed values

21 / 44 Projected stochastic gradient method

• Similar to projected gradient method but use a stochastic approximation of the gradient

 (u,k)  pu ← max pu − η∇P F (P, Q), 0  (k,i)  qi ← max qi − η∇Q F (P, Q), 0

• Slow convergence in terms of number of iterations...

• ...but very low iteration cost ⇒ very fast in practice

• However, quite sensitive, to the choice of η 22 / 44 / Coordinate descent

• Update a single variable at a time [Hsieh & Dhillon, 2011, Yu et al. 2012]

Pu,k ← Pu,k + argmin F (P + Eu,k δ, Q) or δ Qk,i ← Qk,i + argmin F (P, Q + Ek,i δ) δ

where Eu,k is a matrix with all elements zero except the (u, k) element which equals one

• Closed-form update in the Euclidean distance case

• My personal favorite in the batch setting 23 / 44 , Passive-aggressive algorithms for NMF

24 / 44 Online algorithms

• In real-world applications, missing entries in R may be observed in real time ◦ A user gave a rating to a movie

◦ A user clicked on a link

• Ideally, P and Q should be updated in real time to reflect the knowledge that we gained from the new entry

25 / 44 Online algorithms

1. Initialize P and Q randomly

  q1,1 q1,2 q1,3   q q q   2,1 2,2 2,3 q3,1 q3,2 q3,3     p1,1 p1,2 p1,3 ???     p p p   ???   2,1 2,2 2,3   p3,1 p3,2 p3,3 ???

26 / 44 Online algorithms

2. An element of R is revealed (e.g., a user rated a movie)

  q1,1 q1,2 q1,3   q q q   2,1 2,2 2,3 q3,1 q3,2 q3,3     p1,1 p1,2 p1,3 ???     p p p   ?? 3   2,1 2,2 2,3   p3,1 p3,2 p3,3 ???

27 / 44 Online algorithms

3. Update corresponding row of P and column of Q

  q1,1 q1,2 q1,3   q q q   2,1 2,2 2,3 q3,1 q3,2 q3,3     p1,1 p1,2 p1,3 ???     p p p   ?? 3   2,1 2,2 2,3   p3,1 p3,2 p3,3 ???

28 / 44 Large-scale learning using online algorithms

• Online algorithms can also be used in a large-scale batch setting

• Online to batch conversion: make several passes over the dataset • Advantages of online algorithms ◦ Low iteration cost

◦ Low memory footprint

◦ Ease of implementation 29 / 44 Passive-aggressive algorithms for NMF

• Passive-aggressive [Crammer et al., 2006] are online algorithms for classification and regression

• Very popular in the Natural Language Processing (NLP) community

• We propose passive-aggressive algorithms for NMF

30 / 44 Passive-aggressive algorithms for NMF

• On iteration t, rut ,it is revealed

• We propose to update put and qit by

t+1 1 t 2 t pu = argmin kp − pu k s.t. |p · qi − rut ,it | = 0 t m t t p∈R+ 2

t+1 1 t 2 t qi = argmin kq − qi k s.t. |pu · q − rut ,it | = 0 t m t t q∈R+ 2 • Conservative (do not change model too much) and corrective (satisfy constraint) update

31 / 44 Passive-aggressive algorithms for NMF

Since the two problems are the same, we can simplify notation w = p or q (variable) w = pt+1 or qt+1 (solution) t+1 ut it w = pt or qt (current iterate) t ut it x = qt or pt (input) t it ut

yt = rut ,it (target)

32 / 44 Passive-aggressive algorithms for NMF

• Allow to not perfectly fit the target

1 2 w t+1 = argmin kw − w tk s.t. |w · xt − yt| ≤  m w∈R+ 2

• If |w t · xt − yt| ≤ , the algorithm is “passive”, i.e., w t+1 = w t • Otherwise, it is “aggressive”: the model is updated

33 / 44 Passive-aggressive algorithms for NMF

• The previous update changes the model as much as needed to satisfy the constraint ⇒ potential overfitting

• Introduce a slack variable to allow some error

∗ 1 2 w t+1, ξ = argmin kw − w tk + Cξ m w∈R+, ξ∈R+ 2

s.t. |w · xt − yt| ≤  + ξ,

• C > 0 controls the trade-off between being conservative and corrective

34 / 44 Passive-aggressive algorithms for NMF

• The solution is of the form   w t+1 = max w t + (κ − θ)xt, 0

where κ and θ are non-negative scalars • In our AISTATS paper, we present three O(m) methods for finding κ and θ [Blondel et al., 2014] ◦ An exact method

◦ A bisection method

◦ An approximate update method

35 / 44 Passive-aggressive algorithms for NMF

Difference between the effect of  and C • Increasing  increases the number of passive updates ⇒ trades some error for faster training

• Reducing C reduces update aggressiveness, since 0 ≤ κ ≤ C and 0 ≤ θ ≤ C ⇒ reduces overfitting

36 / 44 NMF algorithm comparison

Solver Iteration cost Online Hyper-parameter Multiplicative Projected grad. //, Projected stochastic grad. //, Coordinate descent ,,/ Passive-Aggressive ,/, ,,, In the setting with missing values.

37 / 44 Experimental results

• Datasets used

Dataset Users Items Ratings Movielens10M 69,878 10,677 10,000,054 Netflix 480,189 17,770 100,480,507 Yahoo-Music 1,000,990 624,961 252,800,275

• We split ratings into 4/5 for training and 1/5 for testing

• The task is to predict ratings in the test set

38 / 44 Convergence results

With C =10 2 With C =10 1 100 − 100 −

80 80

60 60 Exact 3 Bisec (τ =10− ) 2

Absolute error Absolute error Bisec (τ =10− ) 40 40 1 Bisec (τ =10− ) Approx 20 20 10-1 100 101 10-1 100 101 Training time (seconds) Training time (seconds)

Results w.r.t. test data on the Movielens10M dataset

39 / 44 Comparison with other solvers

Dataset Passes NN-PA-I SGD CD Error 23.75 ± 0.05 31.58 ± 1.91 34.59 ± 0.03 1 Time 3.24 ± 0.01 2.68 ± 0.01 3.88 ± 0.01 Error 20.91 ± 0.04 25.27 ± 0.02 21.38 ± 0.05 Movielens10M 3 Time 10.28 ± 0.01 8.09 ± 0.08 12.73 ± 0.01 Error 20.61 ± 0.01 24.54 ± 0.02 20.47 ± 0.01 5 Time 17.40 ± 0.06 13.44 ± 0.03 22.57 ± 0.01 Error 22.32 ± 0.01 27.29 ± 0.81 34.31 ± 0.01 1 Time 34.29 ± 0.10 27.68 ± 0.41 36.58 ± 0.37 Error 20.01 ± 0.01 24.28 ± 0.01 21.60 ± 0.01 Netflix 3 Time 109.53 ± 2.97 82.98 ± 0.14 153.46 ± 0.72 Error 19.64 ± 0.01 23.70 ± 0.14 19.37 ± 0.01 5 Time 181.43 ± 0.22 133.59 ± 0.60 270.28 ± 0.49 Error 50.64 ± 0.33 52.52 ± 0.68 57.08 ± 0.28 1 Time 114.16 ± 0.05 96.89 ± 0.04 170.38 ± 0.06 Error 38.44 ± 0.16 44.63 ± 1.24 45.32 ± 0.23 Yahoo-Music 3 Time 335.13 ± 0.34 291.59 ± 0.24 468.86 ± 0.69 Error 36.26 ± 0.09 41.62 ± 1.15 37.97 ± 0.21 5 Time 576.08 ± 0.73 475.86 ± 2.90 787.57 ± 1.68

40 / 44 Learned topic model

Topic 1 Topic 2 Topic 3 Scream Dumb & Dumber Pocahontas (Comedy, Horror, Thriller) (Comedy) (Animation, Children, Musical, ...) The Fugitive Ace Ventura: Pet Detective (Thriller) (Comedy) (Adventure, Animation, Children, ...) The Blair Witch Project Five Corners Merry Christmas Mr. Lawrence (Horror, Thriller) (Drama) (Drama, War) Deep Cover Ace Ventura: When Nature Calls Toy Story (Action, Crime, Thriller) (Comedy) (Adventure, Animation, Children, ...) The Plague of the Zombies Jump Tomorrow The Sword in the Stone (Horror) (Comedy, Drama, Romance) (Animation, Children, Fantasy, ...)

Topic 4 Topic 5 Topic 6 Belle de jour Four Weddings and a Funeral Terminator 2: Judgment Day (Drama) (Comedy, Romance) (Action, Sci-Fi) Jack the Bear The Birdcage Braveheart (Comedy, Drama) (Comedy) (Action, Drama, War) The Cabinet of Dr. Caligari Shakespeare in Love Aliens (Crime, Drama, Fantasy, ...) (Comedy, Drama, Romance) (Action, Horror, Sci-Fi) M*A*S*H Henri V Mortal Kombat (Comedy, Drama, War) (Drama, War) (Action, Adventure, Fantasy) Bed of Roses Three Men and a Baby Congo (Drama, Romance) (Comedy) (Action, Adventure, Mystery, ...) 6 out of 20 topics extracted from the Movielens10M

41 / 44 dataset Conclusion

• NMF is a widely-used method in machine learning and signal processing

• Its main applications are high-level feature extraction, denoising and matrix completion

• We proposed online passive-aggressive algorithms for the setting with missing values

42 / 44 References

Mathieu Blondel, Yotaro Kubo, and Naonori Ueda. Online Passive-Aggressive Algorithms for Non-Negative Matrix Factorization and Completion Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistic, 96–104, 2014.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585, 2006.

Cho-Jui Hsieh and Inderjit S. Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1064–1072, 2011.

C´edric F´evotte, A Taylan Cemgil. Nonnegative matrix factorizations as probabilistic inference in composite models. In Proceedings of EUSIPCO, pages 1913–1917, 2009. 43 / 44 References

Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.

Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, pages 556–562, 2001.

Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10):2756–2779, 2007.

Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In ICDM, pages 765–774, 2012.

44 / 44