<<

arXiv:1406.1385v1 [cs.LG] 5 Jun 2014 ymxmmlklho ihrueo h erigpicpef principle learning the of reuse with β likelihood maximum by h best the ymxmmlklho.Nx,w reformulate we Next, likelihood. maximum by em of terms aiu ieiodetmto.W rtpooea approxi- an propose standar the on first for based distribution We Tweedie family, mated given estimation. tas a given likelihood among sele a maximum divergence automatic a for f facilitates best that divergence the available on framework of optimal are a heavily be the present results we have of depends few Here divergences choice very task objective of but an learning variety analyzed, and large a suggested A such optimizatio divergence. of network are suitable Bayesian its success Examples and problems. found The Nei models, learning has topic Stochastic machine Embedding, tensors Factorization, of Matrix/Tensor variety or Nonnegative a matrices in nonnegative use two between ne at nvriy 07,Fnad -al onur.dikm e-mail: Finland. erkki.oja@aalto.fi 00076, rong.yang@aalto.fi; University, Aalto ence, ailiae 2] n ui pcrgas[21]. spectrograms [ audio electroencephalography and [19], applied [20], text images successfully as facial such and widely various been to inform have on as based divergences such techniques tion analysis ones Data broader [18]. fou divergences to The belong [15]). Csiszár-Morimoto turn (e.g. the in combinations their families and parametric [14] [13], parametric [12], used popularly most including the Bayesian families summarize we and II, [8], Section [7], models [9]. topic Nei optimization Stochastic [6], network [4]), [3], [5], [2], Matri Embedding [1], Nonnegative e.g. include (see and applications Factorization data Typical observed the model. min- between to the error is objective approximation the the and where imize tensors problems learning nonnegative informat many for in Presently, used extended been values. probabilit have nonnegative two divergences between to dissimilarity distributions the whe maps theory divergence estimation in a originated They learning. machine accurately families. quite p divergence learning can various non- different method and across to our divergence r information that extended and select demonstrate synthetic is data both framework on world selection divergences. automatic separable our that egbrembedding. st neighbor factorization, matrix nonnegative likelihood, maximum and dvrec.Frhroe eso h oncin between connections the show we Furthermore, -divergence. h uhr r ihDprmn fIfrainadComputer and Information of Department with are authors The Abstract hr xs ag ait fifraindvrecs In divergences. information of variety large a exist There modern in element essential an are divergences Information Terms Index β dvrecsa ela éy-and Rényi- as well as -divergences β β Ifraindvrec htmaue h difference the measures that divergence —Information dvrec,wihealsatmtcslcinof selection automatic enables which -divergence, hnbcmsamcielann rbe solved problem learning machine a becomes then ifraindvrec,Tededistribution, Tweedie divergence, —information α erigteIfrainDivergence Information the Learning -, .I I. β f -, dvrecs[6,[7 n Bregman and [17] [16], -divergences NTRODUCTION γ n éy-iegne 1] [11], [10], Rényi-divergences and - β dvrec aiy Selecting family. -divergence nrDke,ZiogYn,adEkiOja, Erkki and Yang, Zhirong Dikmen, Onur α dvrecs such -divergences, α dvrec in -divergence nalofi zhi- en@aalto.fi; ochastic roblems ghbor ghbor ction eal- Sci- ion 3], en or or γ re a- k. n. α d x y r - aaaenee o erig o xml,cutranalysis all cluster where example, for tasks learning, for to for Th problematic needed applied entries. are also be tensor data is over cannot approach non-separable however divergences validation candidate are This of that set. number divergences validation finite a a suitable workaroun among using conventional be select A to to [21]. is proven processing signal has documents audio divergence text for in Itakura-Saito is topics and finding for [7]; Euclidean di success Kullback-Leibler shown example, noise; has For Gaussian gence with data. data for in suitable noise dif handle of to types flexibility family on a the depends in increases task much greatly learning divergences very or a Formulating task because estimation used. particular divergence issue divergence-based the a important given in one an method best a modeling the is of select This performance to application. the how given on a research for little is there gences, nw ohv eainhpwith relationship which a [22], have distribution four task Tweedie to the modeling the known data among is given starting-point divergence any Our in best families the parametric popular selecting for learning epooehr oe itiuinuiga xoeta ov exponential an using distribution the novel a here propose we eettebs iegnei te aaercfmle.We families. parametric other in divergence reformulate best the select all spe for important exists four it clo 2) at is cases; it especially 1) distribution, properties: Tweedie nice the following the has distribution Mo D loprom sacrtl sMEDAL. as accurately as performs also t EDA show we on section, SM experiments no methods, the for In estimation proposed densities. specifically parameter normalized [25], using (SM) out Score carried e.g., be also can aclto fTedelklho scmlctdadprone large and for complicated problems numerical is likelihood Tweedie of calculation vroe )Tededsrbto sntdfie o all for be defined to not have is method distribution Tweedie MTL 1) the overcome: with shortcomings two practice, icpie n tagtowr ehdfrcosn the choosing principle for method in straightforward is optimal and (MTL) disciplined Likelihood a Tweedie Maximum The hneo aaeess that so parameters of change umnainLklho MDL hsgvsamr robust more a with gives β Divergence thus (MEDAL) Exponential Likelihood the Augmentation Maximizing integrat of numerical method by The ca estimated constant accurately normalizing and the includes univariate efficiently available. is likelihood density analytically its the not since is i.e., which density, constant normalizing non-normalized EDA a (EDA). dens Augmentation new is with the Divergence call We Exponential software. the statistical standard by lated eeto nawdrrneta MTL. than wider a in selection oprdt h ihsto vial nomto diver- information available of set rich the to Compared nScinII epooeanwmto fstatistical of method new a propose we III, Section In Besides β dvrec ihaseicagetto em h new The term. augmentation specific a with -divergence β au.Hwvr nodrfrti ob esbein feasible be to this for order in However, value. β dvrec,teMDLmto setne to extended is method MEDAL the -divergence, α dvrec ntrsof terms in -divergence β ∈ R β ooecm hs drawbacks, these overcome To . )islklho a ecalcu- be can likelihood its 3) ; α a eotmzduigthe using optimized be can β dvrec 2] [24]. [23], -divergence β β dvrec fe a after -divergence siaino EDA on estimation ferent β eto se be n But, ver- ion. 2) ; cial . hat ity of n- to er is d a e 1 . 2

MEDAL method. Our method can also be applied to non- where DI, DP, DIP, and DH denote non-normalized separable cases. We show the equivalence between β and Kullback-Leibler, Pearson Chi-square, inverse Pearson γ-divergences, and between α and Rényi divergences by a and Hellinger , respectively. connecting scalar, which allows us to choose the best γ- or β-divergence [30], [31] is defined as • Rényi-divergence by reusing the MEDAL method. xβ+1 + βµβ+1 (β + 1)x µβ We tested our method with extensive experiments, whose re- D (x µ)= i i i − i i . (2) β || β(β + 1) sults are presented in Section IV. We have used both synthetic P data with a known distribution and real-world data including The family contains the following special cases: music, stock prices, and social networks. The MEDAL method 1 2 is applied to different learning problems: Nonnegative Matrix Dβ=1(x µ)=DEU(x µ)= (xi µi) (3) || || 2 − i Factorization (NMF) [26], [3], [1], Projective NMF [27], [28] X and Symmetric Stochastic Neighbor Embedding for visualiza- xi Dβ 0(x µ)=DI(x µ)= xi ln xi + µi tion [5], [6]. We also demonstrate that our method outperforms → || || µ − i  i  Score Matching on Exponential Divergence distribution (ED), X (4) a previous approach for β-divergence selection [29]. Conclu- xi xi sions and discussions on future work are given in Section V. Dβ 1(x µ)=DIS(x µ)= ln 1 →− || || µ − µ − i i i X   II.INFORMATION DIVERGENCES (5) Many learning objectives can be formulated as an approxi- xi 1 1 Dβ= 2(x µ)= + , (6) − || 2µ2 − µ 2x mation of the form x µ, where x > 0 is the observed data i  i i i  (input) and µ is the approximation≈ given by the model. The X where D and D denote the Euclidean distance and formulation for µ totally depends on the task to be solved. EU IS Itakura-Saito divergence, respectively. Consider Nonnegative Matrix Factorization: then x > 0 is a γ-divergence [13] is defined as data matrix and µ is a product of two lower-rank nonnegative • matrices which typically give a sparse representation for the 1 D (x µ)= ln xγ+1 + γ ln µγ+1 columns of x. Other concrete examples are given in Section γ || γ(γ + 1) i i " i ! i ! IV. X X The approximation error can be measured by various in- γ (γ + 1) ln x µ . (7) formation divergences. Suppose µ is parameterized by Θ. − i i i !# The learning problem becomes an optimization procedure X The normalized Kullback-Leibler (KL) divergence is a that minimizes the given divergence D(x µ(Θ)) over Θ. Regularization may be applied for Θ for complexity|| control. special case of γ-divergence: For notational brevity we focus on definitions over vectorial x, x˜i Dγ 0(x µ)= DKL(x˜ µ˜)= x˜i ln , (8) µ, Θ in this section, while they can be extended to matrices → || || µ˜ i i or higher order tensors in a straightforward manner. X In this work we consider four parametric families of di- where x˜i = xi/ j xj and µ˜i = µi/ j µj . Rényi divergence [32] is defined as vergences, which are the widely used α-, β-, γ- and Rényi- • P P divergences. This collection is rich because it covers most 1 ρ 1 ρ D (x µ)= ln x˜ µ˜ − (9) commonly used divergences. The definition of the four fami- ρ || ρ 1 i i lies and some of their special cases are given below. −   for ρ > 0. The Rényi divergence also includes the α-divergence [10], [11] is defined as • normalized Kullback-Leibler divergence as its special α 1 α x µ − αx + (α 1)µ case when ρ 1. D (x µ)= i i i − i − i . (1) → α || α(α 1) P − III. DIVERGENCE SELECTION BY STATISTICAL LEARNING The family contains the following special cases: 2 The above rich collection of information divergences basi- 1 (xi µi) D (x µ)=DP(x µ)= − cally allows great flexibility to the approximation framework. α=2 || || 2 µ i i However, practitioners must face a choice problem: how to X xi select the best divergence in a family? In most existing Dα 1(x µ)=DI(x µ)= xi ln xi + µi → || || µ − applications the selection is done empirically by the human. i i X   A conventional automatic selection method is cross-validation 2 Dα=1/2(x µ) =2DH(x µ)=2 (√xi √µi) [33], [34], where the training only uses part of the entries || || − i X of x and the remaining ones are used for validation. This µi method has a number of drawbacks: 1) it is only applicable Dα 0(x µ)=DI(µ x)= µi ln µi + xi → || || x − to the divergences where the entries are separable (e.g. α- i  i  X 2 or β-divergence). Leaving out some entries for γ- and Rényi 1 (xi µi) Dα= 1(x µ)=DIP(x µ)= − divergences is infeasible due to the logarithm or normalization; − || || 2 x i i 2) separation of entries is not applicable in some applications X 3 where all entries are needed in the learning, for example, 2) Maximum Exponential Divergence with Augmentation . Likelihood (MEDAL): Our answer to the above shortcomings Our proposal here is to use the familiar and proven tech- in MTL is to design an alternative distribution with the nique of maximum likelihood estimation for automatic di- following properties: 1) it is close to the Tweedie distribution, vergence selection, using a suitably chosen and very flexible especially for the four crucial points when β 2, 1, 0, 1 ; ∈ {− − } probability density model for the data. In the following we 2) it should be defined for all β R; 3) its pdf can be evaluated ∈ discuss this statistical learning approach for automatic diver- more robustly by standard statistical software. gence selection in the family of β -divergences, followed by From (10) and (11) the pdf of the Tweedie distribution is its extensions to the other divergence families. written as 1 xµβ µβ+1 pTw(x; µ,φ,β)= f(x, φ, β)exp (12) φ β − β +1 A. Selecting β-divergence    1) Maximum Tweedie Likelihood (MTL): We start from the w.r.t. β instead of p, using the relation β =1 p. This holds − probability density function (pdf) of an exponential dispersion when β = 0 and β = 1. The extra terms 1/(1 p) and 6 6 − − model (EDM) [22]: 1/(2 p) in (11) have been absorbed in f(x, φ, β). The cases β =0−or β = 1 have to be analyzed separately. 1 − pEDM(x; θ,φ,p)= f(x, φ, p)exp (xθ κ(θ)) (10) To make an explicit connection with β-divergence defined φ −   in (2), we suggest a new distribution given in the following where φ > 0 is the dispersion parameter, θ is the canonical form: parameter, and κ(θ) is the cumulant function (when φ =1 its 1 papprox(x; µ,φ,β)= g(x, φ, β)exp D (x µ) derivatives w.r.t. θ give the cumulants). Such a distribution has −φ β ||   µ = κ′(θ) and V (µ,p)= φκ′′(θ). This density 1 xβ+1 xµβ µβ+1 is defined for x 0, thus µ> 0. = g(x, φ, β)exp + . ≥ φ −β(β + 1) β − β +1 A Tweedie distribution is an EDM whose variance has a   (13) special form, V (µ) = µp with p R (0, 1). The canonical ∈ \ parameter and the cumulant function that satisfy this property Now the β-divergence for scalar x appears in the exponent, and are [22] g(x, φ, β) will be used to approximate this with the Tweedie

− − distribution. Ideally, the choice µ1 p 1 µ2 p 1 1 p− , if p =1 2 p− , if p =2 θ = − 6 , κ(θ)= − 6 . 1 xβ+1 ( ln µ, if p =1 ( ln µ, if p =2 g(x, φ, β)= f(x, φ, β)/ exp φ −β(β + 1) (11)    would result in full equivalence to Tweedie distribution, as µt 1 Note that ln µ is the limit of − as t 0. Finite analytical t → seen from (12). However, because f(x, φ, β) is unknown in forms of f(x, φ, p) in Tweedie distribution are generally the general case, such g is also unavailable. unavailable. The function can be expanded with infinite series We can, however, try to approximate g using the fact that [35] or approximated by saddle [36]. papprox must be a proper density whose integral is equal to one. It is known that the Tweedie distribution has a connection to From (13) it then follows β-divergence (see, e.g., [23], [24]): maximizing the likelihood of Tweedie distribution for certain p values is equivalent to 1 µβ+1 exp minimizing the corresponding divergence with β =1 p. Es- φ β +1 −   pecially, the gradients of the log-likelihood of Gamma, Poisson 1 xβ+1 xµβ = dx g(x, φ, β)exp + and Gaussian distributions over µi are equal to the ones of β- φ −β(β + 1) β divergence with β = 1, 0, 1, respectively. This motivates a β- Z   (14)  divergence selection method− by Maximum Tweedie Likelihood (MTL). This integral is, of course, impossible to evaluate because we However, MTL has the following two shortcomings. First, do not even know the function inside. However, the integral Tweedie distribution is not defined for p (0, 1). That is, if can be approximated nicely by Laplace’s method. Laplace’s the best β =1 p happens to be in the range∈ (0, 1), it cannot approximation is be found by MTL;− in addition, there is little research on the b 2π Tweedie distribution with β > 1 (p < 0). Second, f(x, φ, p) Mh(x) Mh(x0) dx f(x)e f(x0)e in Tweedie distribution is not the probability normalizing ≈ M h (x0) Za s | ′′ | constant (note that it depends on x), and its evaluation requires ad hoc techniques. The existing software using the infinite where x0 = argmaxx h(x) and M is a large constant. series expansion approach [35] (see Appendix A) is prone to In order to approximate (14) by Laplace’s method, 1/φ numerical computation problems especially for 0.1 <β< 0. takes the role of M and thus the approximation is valid for There is no existing implementation that can calculate− Tweedie small φ. We need the maximizer of the exponentiated term β+1 β likelihood for β > 1. h(x)= x + xµ . This term has a zero first derivative − β(β+1) β 4 and negative second derivative, i.e., it is maximized, at x = µ. The log-likelihood of the EDA density can be written as Thus, Laplace’s method gives us ln p(x; µ,β,φ)= ln p(x ; µ ,β,φ) 1 µβ+1 i i exp i φ β +1 X   β 1 1 = − ln x D (x µ ) ln Z(µ ,β,φ) (19) 2πφ 1 µβ+1 µβ+1 2 i − φ β i|| i − i g(µ,φ,β)exp + i   β 1 X ≈ s µ − φ −β(β + 1) β |− |    due to the fact that Dβ(x µ) in Eq. (2) and the augmentation β+1 || 2πφ 1 µ term in (18) are separable over xi, (i.e. xi are independent = g(µ,φ,β)exp . µβ 1 φ β +1 given µi). The best β is now selected by s −   The approximation gives 1 (β 1)/2 which g(µ,φ,β) = √2πφ µ − β∗ = argmax max ln p(x; µ,β,φ) , (20) β φ suggests the function   where µ x η . We call the new divergence 1 (β 1)/2 1 (β 1) = arg minη Dβ( ) g(x, φ, β)= x − = exp − ln x . || √2πφ √2πφ 2 selection method Maximum EDA Likelihood (MEDAL).   Let us look at the four special cases of Tweedie distribu- Putting this result into (13) as such does not guarantee a tion: Gaussian ( ), Poisson ( ), Gamma ( ) and Inverse proper pdf however, because it is an approximation, only valid Gaussian ( ).N They correspondPO to β = 1,G0, 1, 2. For at the limit φ 0. To make it proper, we have to add a simplicity ofIN notation, we may drop the subscript−i and− write normalizing constant→ into the density in (13). x and µ for one entry in x and µ. Then, the log-likelihoods The pdf of the final distribution, for a scalar argument x, of the above four special cases are thus becomes 1 1 1 1 ln p (x; µ, φ)= ln(2πφ) (x µ)2, papprox(x; µ,β,φ)= exp R(x, β) D (x µ) Z(µ,β,φ) − φ β || N − 2 − 2φ −  (15) ln p (x; µ)=x ln µ µ ln Γ(x + 1), PO − − where Z(µ,β,φ) is the normalizing constant counting for x ln µ µ ln(2πx)/2 x ln x + x, ≈ − − x − the terms which are independent of x, and R(x, β) is an ln p (x;1/φ, φµ) =(1/φ 1) ln x augmentation term given as G − − φµ (1/φ) ln(φµ) ln Γ(1/φ), β 1 − − R(x, β)= − ln x . (16) 2 1 3 1 1 x 1 1 ln p (x; µ, 1/φ)= ln(2πφx ) 2 + , This pdf is a proper density for all β R, which is IN − 2 − φ 2 µ − µ 2x ∈   guaranteed by the following theorem. where in the Poisson case we employ Stirling’s approxima- β 1 1 1 Theorem 1: Let f(x) = exp − ln x D (x µ) . tion . To see the similarity of these four special cases with the 2 − φ β || ∞ general expression for the EDA log-likelihood in Eq. (19), The improper integral 0 f(x)dx nconverges. o β 1 let us look at one term in the sum there. It is a fairly Proof: Let q = − +1+ ǫ with any ǫ (0, ), R 2 ∈ ∞ straightforward exercise to plug in the β-divergences from q β 1 Eqs. (3,4,5,6) and the augmentation term from Eq. (18) and and g(x) = x− . By these definitions, we have q > −2 , see that the log-likelihoods coincide. The normalizing term and then for , β 1 , x 1 −2 + q φ ln x 0 Dβ( x µ) ≥ ≤ ≤ || ln Z(µ,β,φ)] for these special cases can be determined from i.e. 0 f(x) g(x). By Cauchy convergence test, we ≤ ≤ the corresponding density. ∞ know that 1 g(x)dx is convergent because q > 1, and so In general, the normalizing constant Z(µ,β,φ) is in- ∞ is 1 f(x)dx. Obviously f(x) is continuous and bounded for tractable except for a few special cases. Numerical evaluation R 1 x [0, 1]. Therefore, for x 0, ∞ f(x)dx = f(x)dx + of Z(µ,β,φ) can be implemented by standard statistical ∈R ≥ 0 0 ∞ f(x)dx also converges. software. Here we employ the approximation with Gauss- 1 R R Finally, for vectorial x, the pdf is a product of the marginal Laguerre quadratures (details in Appendix B). R densities: Finally, let us note that in addition to the maximum likeli- hood estimator, Score Matching (SM) [25], [37] can be applied 1 1 to estimation of β as a density parameter (see Section IV-A). In pEDA(x; µ,β,φ)= exp R(x,β) D (x µ) Z(µ,β,φ) − φ β || a previous effort, Lu et al. [29] proposed a similar exponential  (17) divergence (ED) distribution

where Dβ(x µ) is defined in (2) and pED(x; µ,β) exp [ D (x µ)] , (21) || ∝ − β || β 1 R(x,β)= − ln x . (18) but without the augmentation. It is easy to show that ED also 2 i i exists for all β by changing q =1+ǫ in the proof of Theorem X We call (17) the Exponential Divergence with Augmentation 1The case β = 0 and φ =6 1 does not correspond to Poisson distribution, (EDA) distribution, because it applies an exponential over an but the transformation pEDM(x; µ, φ, 1) = pPO(x/φ; µ/φ)/φ can be used information divergence plus an augmentation term. to evaluate the pdf. 5

1. We will empirically illustrate the discrepancy between ED Similarly, we can also reduce a Rényi divergence to its and EDA in Section IV-A, showing that the selection based corresponding α-divergence with the same proof technique on ED is however inaccurate, especially for β 0. (see Appendix C). ≤ Theorem 3: For x 0 and τ > 0, ≥ B. Selecting α-divergence arg min Dρ τ (x µ) = arg min min Dα τ (x cµ) . (24) µ 0 → || µ 0 c>0 → || We extend the MEDAL method to α-divergence selection. ≥ ≥   This is done by relating α-divergence to β-divergence with a XPERIMENTS α 2α IV. E nonlinear transformation between α and β. Let yi = xi /α , α 2α In this section we demonstrate the proposed method on mi = µ /α and β =1/α 1 for α =0. We have i − 6 various data types and learning tasks. First we provide the 1 β+1 β+1 β results on , whose density is known, to compare Dβ(yi mi)= y + βm (β + 1)yim || β(β + 1) i i − i the behavior of MTL, MEDAL and the score matching method 2  α 1 α  α x 1 α µ 1 x µ − [29]. Second, we illustrate the advantage of the EDA density = i + i i i − 2 − 2 2α 2(1 α) over ED. Third, we apply our method on α- and β-divergence α 1 α α α − α α α − −   selection in Nonnegative Matrix Factorization (NMF) on real- =D (x µ ) α i|| i world data including music and stock prices. Fourth, we test This relationship allows us to evaluate the likelihood of µ MEDAL in selecting non-separable cases (e.g. γ-divergence) and α using yi and β: for Projective NMF and s-SNE visualization learning tasks across synthetic data, images, and a dolphin social network. dy p(x ; µ , α, φ)= p(y ; m ,β,φ) i i i i i dx i A. Synthetic data α 1 x − i 1) -divergence selection: We use here scalar data gen- = p(yi; mi,β,φ) 2(α 1/2) β α − β erated from the four special cases of Tweedie distributions, = p(y ; m ,β,φ)y− β +1 i i i | | namely, Inverse Gaussian, Gamma, Poisson, and Gaussian distributions. We simply fit the best Tweedie, EDA or ED In vectorial form, the best α for Dα(x µ) is then given by || density to the data using either the maximum likelihood α∗ =1/(β∗ + 1) where method or score matching (SM).

β∗ =argmax max ln p(y; m,β) In Fig. 1 (first row), the results of the Maximum Tweedie β φ Likelihood (MTL) are shown. The β value that maximizes n  β ln yi + ln β +1 , (22) the likelihood in Tweedie distribution is consistent with the − | | true parameters, i.e., -2, -1, 0 and 1 respectively for the above o where m = arg minη D (y η). This transformation method distributions. Note that Tweedie distributions are not defined β || can handle all α except α 0 since it corresponds to β . for β (0, 1), but β-divergence is defined in this region, which → → ∞ will lead∈ to discontinuity in the log-likelihood over β. The second and third rows in Fig. 1 present results of the C. Selecting γ- and Rényi divergences exponential divergence density ED given in Eq. (21). The log- Above we presented the selection methods for two families likelihood and negative score matching objectives [29] on the where the divergence is separable over the tensor entries. Next same four datasets are shown. The estimates are consistent we consider selection among γ- and Rényi divergence families with the ground truth Gaussian and Poisson data. However, for where their members are not separable. Our strategy is to Gamma and Inverse Gaussian data, both β estimates deviate reduce γ-divergence to β-divergence with a connecting scalar. from the ground truth. Thus, estimators based on ED do This is formally given by the following result. not give as accurate estimates as the MTL method. The ED Theorem 2: For x 0 and τ R, distribution [29] has an advantage that it is defined also for ≥ ∈ β (0, 1). In the above, we have seen that β selection by ∈ arg min Dγ τ (x µ) = arg min min Dβ τ (x cµ) (23) using ED is accurate when β 0 or β = 1. However, as µ 0 → || µ 0 c>0 → || ≥ ≥   explained in Section III-A2, in the→ other cases ED and Tweedie The proof is done by zeroing the derivative right hand side distributions are not the same because the terms containing with respect to c (details in Appendix C). the observed variable in these distributions are not exactly the Theorem 2 states that with a positive scalar, the learning same as those of the Tweedie distributions. problem formulated by a γ-divergence is equivalent to the EDA, the augmented ED density introduced in Sec- one by the corresponding β-divergence. The latter is separable tion III-A, not only has both the advantage of continuity but and can be solved by the methods described in the Section also gives very accurate estimates for β < 0. The MEDAL log- III-A. An example is between normalized KL-divergence (in likelihood curves over β based on EDA are given in Fig. 1 γ-divergence) and the non-normalized KL-divergence (in β- (fourth row). In the β selection of Eq. (20), the φ value P x divergence) with the optimal connecting scalar c = i i . that maximizes the likelihood with β fixed is found by a Pi µi Example applications on selecting the best γ-divergence are grid search. The likelihood values are the same as those of given in Section IV-D. special Tweedie distributions and there are no abrupt changes 6

4 IN x 10 G PO N 0 −6500 −5200 −1000

−7000 −5400 −1500 −1 −5600 −7500 −2000 −2 −5800 −8000 −2500 −6000 −3 −8500 −6200 −3000 log−likelihood (Tweedie) log−likelihood (Tweedie) log−likelihood (Tweedie) log−likelihood (Tweedie) −4 −9000 −6400 −3500 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 β (best β=−2) β (best β=−1) β (best β=−0.08) β (best β=1)

4 IN G PO N x 10 −0.9 −6000 −5200 −1000

−0.95 −5400 −1500 −7000 −5600 −1 −2000 −8000 −5800 −1.05 −2500 −6000 −9000 log−likelihood (ED) −1.1 log−likelihood (ED) log−likelihood (ED) −6200 log−likelihood (ED) −3000

−1.15 −10000 −6400 −3500 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 β (best β=−1.32) β (best β=−0.5) β (best β=−0.08) β (best β=0.94)

7 N G 6 PO x 10 IN x 10 8 20 800 2

6 15 600 1.5

10 400 1 4

5 200 0.5 2 negative SM objective (ED) negative SM objective (ED) negative SM objective (ED) negative SM objective (ED) 0 0 0 0 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 β (best β=2) β (best β=−0.63) β (best β=−0.09) β (best β=0.92)

IN G PO N 0 −6000 −5200 −1000

−5400 −1500 −7000 −5000 −5600 −2000 −8000 −5800 −2500 −10000 −6000 −9000

log−likelihood (EDA) log−likelihood (EDA) log−likelihood (EDA) −6200 log−likelihood (EDA) −3000

−15000 −10000 −6400 −3500 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 β (best β=−2) β (best β=−1.02) β (best β=−0.08) β (best β=0.94)

7 N IN G 6 PO x 10 x 10 8 1000 1500 2

1000 500 1.5 6 500 0 1 4 0 −500 0.5 2 −500 negative SM objective (EDA) negative SM objective (EDA) −1000 negative SM objective (EDA) −1000 negative SM objective (EDA) 0 0 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 β (best β=−2) β (best β=−1) β (best β=−0.09) β (best β=0.92) a)InverseGaussian b)Gamma c)Poisson d)Gaussian

Fig. 1. β selection using (from top to bottom) Tweedie likelihood, ED likelihood, negative SM objective of ED, EDA likelihood, and negative SM objective of EDA. Data were generated using Tweedie distribution with β = −2, −1, 0, 1 (from left to right). or discontinuities in the likelihood surface. We also estimated and the transformation from Section III-B. The ground truth β for the EDA density using Score Matching, and curves of α 1 is successfully recovered with MTL. However, there the negative SM objective are presented in the bottom row of are→ no likelihood estimates for α (0.5, 1), corresponding Fig. 1. They also recover the ground truth accurately. to β (0, 1) for which no Tweedie∈ distributions are defined. Moreover,∈ to our knowledge there are no studies concerning 2) α-divergence selection: There is only one known gen- the pdf’s of Tweedie distributions with β > 1. For that reason, erative model for which the maximum likelihood estima- the likelihood values for α [0, 0.5) are left blank in the plot. tor corresponds to the minimizer of the corresponding α ∈ divergence. It is the Poisson distribution. We thus reused It can be seen from Fig. 2b and 2c, that the augmentation the Poisson-distributed data of the previous experiments with in the MEDAL method also helps in α selection. Again, both the β-divergence. In Fig. 2a, we present the log-likelihood ED and EDA solve most of the discontinuity problem except objective over α obtained with Tweedie distribution (MTL) α = 0. Selection using ED fails to find the ground truth 7

PO PO PO 4 PO x 10 −2950 −3040 −3000 1.6

−3000 −3060 −3050 1.5 −3050 −3080 1.4 −3100 −3100 −3100 1.3 −3150 −3150 −3120 1.2 log−likelihood (ED) −3200 −3140 log−likelihood (EDA) −3200 1.1 log−likelihood (Tweedie)

−3250 −3160 −3250 negative SM objective (EDA) 1 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 α (best α=1.01) α (best α=0.25) α (best α=0.97) α (best α=0.87) (a) (b) (c) (d)

Fig. 2. Log-likelihood of (a) Tweedie, (b) ED, and (c) EDA distributions for α-selection. In the Tweedie plot, blanks correspond to β = 1/α − 1 values for which a Tweedie distribution pdf does not exist or cannot be evaluated, i.e., β ∈ (0, 1) ∪ (1, ∞). In (d), negative SM objective function values are plotted for EDA.

6 Piano excerpt 4 Piano excerpt x 10 x 10 which equals 1, which is however successfully found by the 4 2

MEDAL method. SM on EDA recovers the ground truth as 2 1 well (Fig. 2d). 0 0

−2 −1 B. Divergence selection in NMF

log−likelihood (EDA) −4 log−likelihood (EDA) −2

−6 −3 The objective in nonnegative matrix factorization (NMF) −2 −1 0 1 2 −2 −1 0 1 2 is to find a low-rank approximation to the observed data β (best β=−1) α (best α=−0.2) by expressing it as a product of two nonnegative matrices, a) β div. b) α div. F N F K 5 Piano excerpt R R x 10 i.e., V V = WH with V +× , W +× 4 ≈ K N ∈ ∈ and H R × . This objective is pursued through the ∈ + 3 minimizationb of an information divergence between the data and the approximation, i.e., D(V V). The divergence can be 2 any appropriate one for the data/application|| such as β, α, γ, 1 Rényi, etc. Here, we chose the β andbα divergences to illustrate 0 the MEDAL method for realistic data. negative SM objective (EDA) −1 −2 −1 0 1 2 The optimization of β-NMF was implemented using the β (best β=−1) standard multiplicative update rules [23], [38]. Similar mul- c) β div. tiplicative update rules are also available for α-NMF [23]. Alternatively, the algorithm for β-NMF can be used for α- Fig. 3. (a, b) Log likelihood values for β and α for the spectrogram of a short piano excerpt with F = 513, N = 676, K = 6. (c) Negative SM divergence minimization as well, using the transformation objective for β. explained in Section III-B. 1) A Short Piano Excerpt: We consider the piano data used in [21]. It is an audio sequence recorded in real conditions, with α =0.5 is still much less than the one for β = 1. SM consisting of four notes played all together in the first measure also finds β = 1 as can be seen from Fig. 3c. − and in all possible pairs in the subsequent measures. A − power spectrogram with analysis window of size 46 ms was computed, leading to F = 513 bins and N = 676 C. Stock Prices time frames. These make up the data matrix V, for which a Next, we repeat the same on a stock price dataset matrix factorization V = WH with low rank K =6 is sought which contains Dow Jones Industrial Average. There are 30 for. companies included in the data. They are major American In Fig. 3a and 3b,b we show the log-likelihood values of companies from various sectors such as services (e.g., Wal- the MEDAL method for β and α, respectively. For each mart), consumer goods (e.g., General Motors) and healthcare parameter value β and α, the multiplicative algorithm for the (e.g., Pfizer). The data was collected from 3rd January 2000 respective divergence is run for 100 iterations and likelihoods to 27th July 2011, in total 2543 trading dates. We set K =5 are evaluated with mean values calculated from the returned in NMF and masked 50% of the data by following [39]. The matrix factorizations. For each value of β and α, the highest stock data curves are displayed in Fig. 4 (left). likelihood w.r.t. φ (see Eq. (20)) is found by a grid search. The EDA likelihood curve with β [ 2, 2] is shown in The found maximum likelihood estimate β = 1 corre- Figure 4 (bottom left). We can see that∈ the− best divergence sponds to Itakura-Saito divergence, which is in harmony− with selected by MEDAL is β = 0.4. The corresponding best the empirical results presented in [21] and the common belief φ = 0.006. These results are in harmony with the findings that IS divergence is most suitable for audio spectrograms. The of Tan and Févotte [39] using the remaining 50% of the optimal α value value was 0.5 corresponding to Hellinger dis- data as validation set, where they found that β [0, 0.5] tance. We can also see that the log likelihood value associated (mind that our β values equal theirs minus one)∈ performs 8

150 matrix only appears once in the approximation, the matrix W occurs twice in V. Thus it is a special case of Quadratic 100 Nonnegative Matrix Factorization (QNMF) [41]. We choose PNMFb for two reasons: 1) we demonstrate stock prices 50 the MEDAL performance on QNMF besides the linear NMF already shown in Section IV-B; 2) PNMF contains only one 0 500 1000 1500 2000 2500 days variable matrix in learning, without the issue of how to 8 stock prices 4 stock prices x 10 x 10 interleave the updates of different variable matrices. 0 8 We first tested MEDAL on a synthetic dataset. We generated −2 6 a diagonal blockwise data matrix V of size 50 30, where two blocks are of sizes 30 20 and 20 10. The block× entries are 4 −4 uniformly drawn from×[0, 10]. We× then added uniform noise

−6 2 from [0, 1] to the all matrix entries. For each γ, we ran the log−likelihood (EDA) multiplicative algorithm of PNMF by Yang and Oja [28], [42] −8 negative SM objective (EDA) 0 −2 −1 0 1 2 −2 −1 0 1 2 to obtain W and V. The MEDAL method was then applied to β (best β=0.4) β (best β=1) select the best γ. The resulting approximated log-likelihood for Fig. 4. Top: the stock data. Bottom left: the EDA log-likelihood for β ∈ γ [ 2, 2] is shownb in Fig. 6 (2nd row). We can see MEDAL [−2, 2]. Bottom right: negative SM objective function for β ∈ [−2, 2]. and∈ score− matching of EDA give similar results, where the best γ appear at 0.76 and 0.8, respectively. Both resulting W ’s give perfect− clustering accuracy− of data rows. well for a large range of ’s. Differently, our method is more φ We also tested MEDAL on the swimmer dataset [43] which advantageous because we do not need additional criteria nor is popularly used in the NMF field. Some example images data for validations. In Figure 4 (bottom right), negative SM from this dataset are shown in Fig. 5 (left). We vectorized objective function is plotted for . With SM, the β [ 2, 2] each image in the dataset as a column and concatenated the optimal β is found to be 1. ∈ − columns into a 1024 256 data matrix V. This matrix is then fed to PNMF and MEDAL× as in the case for the synthetic D. Selecting γ-divergence dataset. Here we empirically set the rank to K = 17 according In this section we demonstrate that the proposed method to Tan and Févotte [44] and Yang et al. [45]. The matrix W can be applied to applications beyond NMF and to non- was initialized by PNMF based on Euclidean distance to avoid separable divergence families. To our knowledge, no other poor local minima. The resulting approximated log-likelihood existing methods can handle these two cases. for γ [ 1, 3] is shown in Figure 6 (3rd row, left). We can see ∈ − 1) Multinomial data: We first exemplify γ-divergence se- a peak appearing around 1.7. Zooming in the region near the lection for synthetic data drawn from a multinomial dis- peak shows the best γ = 1.69. The score matching objective tribution. We generated a 1000-dimensional stochastic vec- over γ values (Fig. 6 3rd row, right) shows a similar peak tor p from the uniform distribution. Next we drew x and the best γ very close to the one given by MEDAL. Both Multinomial(n, p) with n = 107. The MEDAL method∼ is methods result in excellent and nearly identical basis matrix applied to find the best γ-divergence for the approximation (W) of the data, where the swimmer body as well as four of x by p. limbs at four angles are clearly identified (see Fig. 5 bottom Fig. 6 (1st row, left) shows the MEDAL log-likelihood. row). The peak appears when γ = 0, which indicates that the 3) Symmetric Stochastic Neighbor Embedding: Finally, we normalized KL-divergence is the most suitable one among the show an application beyond NMF, where MEDAL is used to γ-divergence family. Selection using score matching of EDA find the best γ-divergence for the visualization using Symmet- gives the best γ also close to zero (Fig. 6 1st row, right). The ric Stochastic Neighbor Embedding (s-SNE) [5], [6]. result is expected, because the maximum likelihood estimator n Suppose there are n multivariate data samples xi i=1 with of p in multinomial distribution is equivalent to minimizing D { } xi R and their pairwise similarities are represented by the KL-divergence over p. Our finding also justifies the ∈ an n n symmetric nonnegative matrix P where Pii = 0 usage of KL-divergence in topic models with the multinomial × and Pij = 1. The s-SNE visualization seeks a low- distribution [40], [7]. ij dimensional embedding Y = [y , y ,..., y ]T Rn d such 2) Projective NMF: Next we apply the MEDAL method P 1 2 n × that pairwise similarities in the embedding approximate∈ those to Projective Nonnegative Matrix Factorization (PNMF) [27], in the original space. Generally d = 2 or d = 3 for easy [28] based on γ-divergence [13], [19]. Given a nonnegative 2 F N visualization. Denote qij = q( yi yj ) with a certain matrix V R × , PNMF seeks a low-rank nonnegative k − k 1 + 2 − ∈ F K kernel function q, for example qij = 1+ yi yj . matrix W R × (K < F ) that minimizes D V V , k − k ∈ + γ || The pairwise similarities in the embedding are then given by T  where V = WW V. PNMF is able to produce a highly Qij = qij / kl:k=l qkl. The s-SNE target is that Q is as orthogonal W and thus finds its applications in part-basedb close to P as possible.6 To measure the dissimilarity between featureb extraction and clustering analysis, etc. Different from P and Q, theP conventional s-SNE uses the Kullback-Leibler conventional NMF (or linear NMF) where each factorizing divergence DKL(P Q). Here we generalize s-SNE to the || 9

6 Multinomial x 10 Multinomial 6 −5000 5 −6000 4 −7000 3 −8000 2 log−likelihood −9000

negative SM objective 1

−10000 0 −2 −1 0 1 2 −2 −1 0 1 2 γ (best γ=0.00) γ (best γ=0.08)

PNMF, synthetic data PNMF, synthetic data −2000 4000

−2500 3000

−3000 2000

log−likelihood −3500 1000 negative SM objective

−4000 0 −2 −1 0 1 2 −2 −1 0 1 2 γ (best γ=−0.76) γ (best γ=−0.80)

23 PNMF, swimmer data 10 PNMF, swimmer data x 10 x 10 6 15

4 10

2 1.55 1.6 1.65 1.7 1.75 1.55 1.6 1.65 1.7 1.75

5 log−likelihood 0 negative SM objective

−2 0 −1 0 1 2 3 −1 0 1 2 3 Fig. 5. Swimmer dataset: (top) example images; (bottom) the best PNMF γ (best γ=1.69) γ (best γ=1.63) basis (W) selected by using (bottom left) MEDAL and (bottom right) score matching of EDA. The visualization reshapes each column of W to an image γ−NE γ−NE 12000 7000 and displays it by the Matlab function imagesc. 6000 10000 5000 whole family of γ-divergences as dissimilarity measures and 8000 4000 3000 select the best divergence by our MEDAL method. log−likelihood 6000 2 We have used a real-world dolphins dataset . It is the negative SM objective 2000 4000 1000 adjacency matrix of the undirected social network between −2 −1 0 1 2 −2 −1 0 1 2 62 dolphins. We smoothed the matrix by PageRank random γ (best γ=−0.60) γ (best γ=−1.04) walk in order to find its macro structures. The smoothed Fig. 6. Selecting the best γ-divergence: (1st row) for multinomial data, matrix was then fed to s-SNE based on γ-divergence, with (2nd row) in PNMF for synthetic data, (3rd row) in PNMF for the swimmer γ [ 2, 2]. The EDA log-likelihood is shown in Fig. 6 dataset, and (4th row) in s-SNE for the dolphins dataset; (left column) (4th∈ row,− left). By the MEDAL principle the best divergence using MEDAL and (right column) using score matching of EDA. The red star highlights the peak and the small subfigures in each plot shows the zoom-in is γ = 0.6 for s-SNE and the dolphins dataset. Score − around the peak. The sub-figures in the 3rd row zoom in the area near the matching of EDA also indicates the best γ is smaller than peaks. 0. The resulting visualizations created by s-SNE with the respective best gamma-divergence are shown in Fig. 7, where the node layouts by both methods are very similar. In both selection for the parameter over a wider range. The new visualizations we can clearly see two dolphin communities. method has been extended to α-divergence selection by a nonlinear transformation. Furthermore, we have provided new V. CONCLUSIONS results that connect the γ- and β-divergences, which enable We have presented a new method called MEDAL to au- us to extend the selection method to non-separable cases. tomatically select the best information divergence in a para- The extension also holds for Rényi divergence with similar metric family. Our selection method is built upon a statistical relationship to α-divergence. As a result, our method can be learning approach, where the divergence is learned as the applied to most commonly used information divergences in result of standard density parameter estimation. Maximizing learning. the likelihood of the Tweedie distribution is a straightforward We have performed extensive experiments to show the way for selecting β-divergence, which however has some accuracy and applicability of the new method. Comparison shortcomings. We have proposed a novel distribution, the on synthetic data has illustrated that our method is superior to Exponential Divergence with Augmentation (EDA), which Maximum Tweedie Likelihood, i.e., it finds the ground truth overcomes these shortcomings and thus can give a more robust as accurately as MTL, while being defined on all values of β and being less prone to numerical problems (no abrupt changes 2available at http://www-personal.umich.edu/~mejn/netdata/ in the likelihood). We also showed that a previous estimation 10

sible for all β due to intractability of integrals. Non-ML estimators could be used to attack this open problem. The EDA distribution family includes the exact Gaussian, Gamma, and Inverse Gaussian distributions, and approximated Poisson distribution. In the approximation we used the first- order Stirling expansion. One could apply higher-order expan- sions to improve the approximation accuracy. This could be implemented by further augmentation with higher-order terms around β 0. → VI. ACKNOWLEDGMENT This work was financially supported by the Academy of Fin- land (Finnish Center of Excellence in Computational Inference Research COIN, grant no 251170; Zhirong Yang additionally by decision number 140398).

APPENDIX A INFINITE SERIES EXPANSION IN TWEEDIE DISTRIBUTION In the series expansion, an EDM is repre- Fig. 7. Visualization of the dolphins social network with the best γ using sented as a sum of G independent Gamma random variables (top) MEDAL and (bottom) score matching of EDA. Dolphins and their social G x = g yg, where G is Poisson distributed with parameter connections are shown by circles and lines, respectively. The background 2−p illustrates the node density by the Parzen method [46]. µ ; and the shape and scale parameters of the Gamma λ = φP(2 p) − 2 p p 1 distribution are a and b, with a = 1−p and b = φ(p 1)µ − . The pdf of the− Tweedie distribution− is obtained analyticall− y − approach by Score Matching on Exponential Divergence dis- µ2 p at x = 0 as e− φ(2−p) . For x > 0 the function f(x, φ, p) = tribution (ED, i.e., EDA before augmentation) is not accurate, 1 ∞ Wj (x, φ, p), where for 1 2 Projective NMF, and visualization by s-SNE. In those cases j(a 1) ja where the correct parameter value is known in advance for the 1 Γ(1 + ja)φ − (p 1) j Wj = − ( 1) sin( πja). (26) synthetic data, or there is a wide consensus in the application π Γ(1 + j)(p 1)j xja − − − community on the correct parameter value for real-world data, This infinite summation needs approximation in practice. the MEDAL method gives expected results. These results show Dunn and Smyth [35] described an approach to select a subset that the presented method has not only broad applications but of these infinite terms to accurately approximate f(x, φ, p). also accurate selection performance. In the case of new kinds In their approach, Stirling’s approximation of the Gamma of data, for which the appropriate information divergence is functions are used to find the index j which gives the highest not known, the MEDAL method provides a disciplined and value of the function. Then, in order to find the most significant rigorous way to compute the optimal parameter values. region, the indices are progressed in both directions until In this paper we have focused on information divergence for negligible terms are reached. vectorial data. There exist other divergences for higher-order tensors, for example, LogDet divergence and von Newmann APPENDIX B divergence (see e.g. [47]) that are defined over eigenvalues of GAUSS-LAGUERRE QUADRATURES matrices. Selection among these divergences remains an open This method (e.g. [48]) can evaluate definite integrals of the problem. form Here we mainly consider a positive data matrix and selecting n ∞ z the divergence parameter in ( , + ). Tweedie distribution e− f(z)dz f(zi)wi, (27) ∞ ∞ ≈ has no support for zero entries when β < 0 and thus gives 0 i Z X zero likelihood of the whole matrix/tensor by independence.In where zi is the ith root of the n-th order Laguerre polynomial future work, extension of EDA to accommodate nonnegative Ln(z), and the weights are given by data matrices could be developed for β 0. z ≥ w = i . (28) MEDAL is a two-phase method: the β selection is based i 2 2 (n + 1) Ln(zi) on the optimization result of µ. Ideally, both variables should The recursive definition of L (z) is given by be selected by optimizing the same objective. For maximum n log-likelihood estimator, this requires that the negative log- 1 Ln+1(z)= [(2n +1 z)Ln(z) nLn 1(z)] , (29) likelihood equals the β-divergence, which is however infea- n +1 − − − 11

Pi xi with L0(z)=1 and L1(z)=1 z. In our experiments, we Dropping the constant 1 α , and by Lemma 4, minimizing used the Matlab implementation− by Winckel3 with n = 5000. the above is equivalent to− minimization of (for α> 0)

1 α APPENDIX C 1 µ − ln xα i (36) PROOFS OF THEOREMS 2 AND 3 α 1  i µ  i j j ! − X Lemma 4: arg minz af(z) = arg minz a ln f(z) for a R ∈ α P  and f(z) > 0. Adding a constant 1 α ln i xi to the above, the objective The proof of the lemma is simply by the monotonicity of ln. becomes minimizing− Rényi-divergence (replacing α with ρ; P Next we prove Theorem 2. For β R 1, 0 , zeroing see Eq. (9)). ∂D (x cµ) ∈ \{− } β || gives The proofs for the special cases are similar, where the main ∂c steps are given below β x µ ∂Dβ→0(x cµ) i i i β = γ 0 (or α = ρ 1): zeroing || gives c∗ = 1+β . (30) • ∂c P→xi → i µi i P c∗ = P µ . Putting it back, we obtain Dβ 0(x c∗µ)= i i → || Putting it back to minµ mincPDβ(x cµ), we obtain: ( i xi) Dγ 0(x µ). || → || ∂D →− (x cµ) β 1 || β = γ 1: zeroing ∂c gives c∗ = min min Dβ(x cµ) • 1P x → − µ c i , where M is the length of x. Putting it back, || M i µi β 1+β we obtain x µ x µ . x µ Dβ 1( c∗ )= MDγ 1( ) 1 1+β j j j P →− ||∂Dα→0(x cµ) →− || = min x + β µi α = ρ 0: zeroing || gives µ β(1 + β)  i 1+β • → ∂c i i Pj µj ! X X µi µi ln  β β i xi x µ P c∗ = exp . j j j − µ (1 + β) xi µi P i i ! − 1+β  i Pj µj ! X Putting it back, we obtain P 1+β P  β i xiµi 1 1+β µ˜i = min xi β Dα 0(x c∗µ)= exp µ˜i ln + xi, µ β(1 + β)  −  1+β  i P → || − − xi j µj i ! i X  X X     Dropping the constant, and byP Lemma 4, the above is where µ˜i = µi/ j µj . Dropping the constant i xi, equivalent to minimizing minimizing Dα 0(x c∗µ) is equivalent to minimization of µ˜ ln µ˜i .→ AddingP || the constant ln x Pto the i i xi j j 1 1+β β latter, the objective becomes identical to Dρ 0(x µ), β ln µ (1 + β) ln xiµ P P → || β(1 + β)   j  − i  i.e. DKL(µ x). j i ! || X X     1 1+β REFERENCES Adding a constant β(1+β) ln i xi , the objective be- comes minimizing γ-divergence (replacing β with γ; see Eq. P  [1] R. Kompass, “A generalized divergence measure for nonnegative matrix (7)). factorization,” Neural Computation, vol. 19, no. 3, pp. 780–791, 2006. We can apply the similar technique to prove Theorem 3. [2] I. S. Dhillon and S. Sra, “Generalized nonnegative matrix approxima- ∂Dα(x cµ) tions with Bregman divergences,” in Advances in Neural Information For α R 0, 1 , zeroing || gives ∈ \{ } ∂c Processing Systems, vol. 18, 2006, pp. 283–290. 1/α [3] A. Cichocki, H. Lee, Y.-D. Kim, and S. Choi, “Non-negative matrix α 1 α factorization with -divergence,” Pattern Recognition Letters, vol. 29, i xi µi − α c∗ = (31) pp. 1433–1440, 2008. µ P i i  [4] Z. Yang and E. Oja, “Unified development of multiplicative algorithms Putting it back, we obtain for linear and quadratic nonnegative matrix factorization,” IEEE Trans- P actions on Neural Networks, vol. 22, no. 12, pp. 1878–1891, 2011. [5] G. Hinton and S. Roweis, “Stochastic neighbor embedding,” in Advances D (x c∗µ) (32) α || in Neural Information Processing Systems, 2002, pp. 833–840. α 1 α 1/α [6] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” 1 x µ − = αx + (1 α) j j j µ Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. α(1 α)  i − µ i [7] D. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal i P j j ! − X  of Machine Learning Research, vol. 3, pp. 993–1022, 2001. P (33) [8] I. Sato and H. Nakagawa, “Rethinking collapsed variational bayes inference for lda,” in International Conference on Machine Learning  1 α α 1 α 1/α − (ICML), 2012. x µ − xα j j j µ (34) [9] T. Minka, “Divergence measures and message passing,” Microsoft i i  Research, Tech. Rep., 2005. −  j µj !  P  [10] H. Chernoff, “A measure of asymptotic efficiency for tests of a hy-   pothesis based on a sum of observations,” The Annals of Mathematical P 1 α 1/α 1 µ −  x Statistics, vol. 23, pp. 493–507, 1952. = xα i + i i . (35) [11] S. Amari, Differential-Geometrical Methods in Statistics. Springer α 1  i µ  1 α Verlag, 1985. i j j ! − X P− [12] A. Basu, I. R. Harris, N. Hjort, and M. Jones, “Robust and efficient  P  estimation by minimising a density power divergence,” Biometrika, 3available at http://www.mathworks.se/matlabcentral/fileexchange/ vol. 85, pp. 549–559, 1998. 12

[13] H. Fujisawa and S. Eguchi, “Robust paramater estimation with a small [40] T. Hofmann, “Probabilistic latent semantic indexing,” in International bias against heavy contamination,” Journal of Multivariate Analysis, Conference on Research and Development in Information Retrieval vol. 99, pp. 2053–2081, 2008. (SIGIR), 1999, pp. 50–57. [14] A. Cichocki and S.-i. Amari, “Families of alpha- beta- and gamma- [41] Z. Yang and E. Oja, “Quadratic nonnegative matrix factorization,” divergences: Flexible and robust measures of similarities,” Entropy, Pattern Recognition, vol. 45, no. 4, pp. 1500–1510, 2012. vol. 12, no. 6, pp. 1532–1568, 2010. [42] ——, “Unified development of multiplicative algorithms for linear [15] A. Cichocki, S. Cruces, and S.-I. Amari, “Generalized alpha-beta diver- and quadratic nonnegative matrix factorization,” IEEE Transactions on gences and their application to robust nonnegative matrix factorization,” Neural Networks, vol. 22, no. 12, pp. 1878–1891, 2011. Entropy, vol. 13, pp. 134–170, 2011. [43] D. Donoho and V. Stodden, “When does non-negative matrix factoriza- [16] I. Csiszár, “Eine informationstheoretische ungleichung und ihre an- tion give a correct decomposition into parts?” in Advances in Neural wendung auf den beweis der ergodizitat von markoffschen ketten,” Information Processing Systems 16, 2003, pp. 1141–1148. Publications of the Mathematical Institute of Hungarian Academy of [44] V. Y. F. Tan and C. Févotte, “Automatic relevance determination in Sciences Series A, vol. 8, pp. 85–108, 1963. nonnegative matrix factorization,” in Proceedings of 2009 Workshop [17] T. Morimoto, “Markov processes and the h-theorem,” Journal of the on Signal Processing with Adaptive Sparse Structured Representations Physical Society of Japan, vol. 18, no. 3, pp. 328–331, 1963. (SPARS’09), 2009. [18] L. M. Bregman, “The relaxation method of finding the common points [45] Z. Yang, Z. Zhu, and E. Oja, “Automatic rank determination in projective of convex sets and its application to the solution of problems in convex nonnegative matrix factorization,” in Proceedings of the 9th Interna- programming,” USSR Computational Mathematics and Mathematical tional Conference on Latent Variable Analysis and Signal Separation Physics, vol. 7, no. 3, pp. 200–217, 1967. (LVA2010), 2010, pp. 514–521. [19] Z. Yang, H. Zhang, Z. Yuan, and E. Oja, “Kullback-leibler divergence [46] E. Parzen, “On Estimation of a Probability Density Function and ,” for nonnegative for nonnegative matrix factorization,” in Proceedings of The Annals of , vol. 33, no. 3, pp. 1065–1076, 21st International Conference on Artificial Neural Networks, 2011, pp. 1962. 14–17. [47] B. Kulis, M. A. Sustik, and I. S. Dhillon, “Low-rank kernel learning with [20] Z. Yang and E. Oja, “Projective nonnegative matrix factorization with bregman matrix divergences,” Journal of Machine Learning Research, α-divergence,” in Proceedings of 19th International Conference on vol. 10, pp. 341–376, 2009. Artificial Neural Networks, 2009, pp. 20–29. [48] M. Abramowitz and I. A. Stegun, Eds., Handbook of Mathematical [21] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factor- Functions with Formulas, Graphs, and Mathematical Tables, 9th ed. ization with the Itakura-Saito divergence. With application to music New York: Dover, 1972, page 890. analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830, 2009. [22] B. Jørgensen, “Exponential dispersion models,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 49, no. 2, pp. 127– 162, 1987. [23] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari, Nonnegative Matrix and Tensor Factorization. John Wiley and Sons, 2009. [24] Y. K. Yilmaz and A. T. Cemgil, “Alpha/beta divergences and Tweedie models,” CoRR, vol. abs/1209.4280, 2012. [25] A. Hyvärinen, “Estimation of non-normalized statistical models using score matching,” Journal of Machine Learning Research, vol. 6, pp. 695–709, 2005. [26] D. D. Lee and H. S. Seung, “Learning the parts of objects by non- negative matrix factorization,” Nature, vol. 401, pp. 788–791, 1999. [27] Z. Yuan and E. Oja, “Projective nonnegative matrix factorization for image compression and feature extraction,” in Proceedings of 14th Scandinavian Conference on Image Analysis, 2005, pp. 333–342. [28] Z. Yang and E. Oja, “Linear and nonlinear projective nonnegative matrix factorization,” IEEE Transactions on Neural Networks, vol. 21, no. 5, pp. 734–749, 2010. [29] Z. Lu, Z. Yang, and E. Oja, “Selecting β-divergence for nonnegative matrix factorization by score matching,” in Proceedings of the 22nd International Conference on Artificial Neural Networks (ICANN 2012), 2012, pp. 419–426. [30] S. Eguchi and Y. Kano, “Robustifing maximum likelihood estimation,” Institute of Statistical Mathematics, Tokyo, Tech. Rep., 2001. [31] M. Minami and S. Eguchi, “Robust blind source separation by beta divergence,” Neural Computation, vol. 14, pp. 1859–1886, 2002. [32] A. Rényi, “On measures of information and entropy,” in Procedings of 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960, pp. 547–561. [33] M. Mollah, S. Eguchi, and M. Minami, “Robust prewhitening for ica by minimizing beta-divergence and its application to fastica,” Neural Processing Letters, vol. 25, pp. 91–110, 2007. [34] H. Choi, S. Choi, A. Katake, and Y. Choe, “Learning alpha-integration with partially-labeled data,” in Proc. of the IEEE International Confer- ence on Acoustics, Speech, and Signal Processing, 2010, pp. 14–19. [35] P. K. Dunn and G. K. Smyth, “Series evaluation of Tweedie exponential dispersion model densities,” Statistics and Computing, vol. 15, no. 4, pp. 267–280, 2005. [36] ——, “Tweedie family densities: methods of evaluation,” in Proceedings of the 16th International Workshop on Statistical Modelling, 2001. [37] A. Hyvärinen, “Some extensions of score matching,” Comput. Stat. Data Anal., vol. 51, no. 5, pp. 2499–2512, 2007. [38] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the beta-divergence,” Neural Computation, vol. 23, no. 9, 2011. [39] C. F. V. Tan, “Automatic relevance determination in nonnegative matrix factorization with the β-divergence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, accepted, to appear.