Markov Determinantal Point Processes
Total Page:16
File Type:pdf, Size:1020Kb
Markov Determinantal Point Processes Raja Hafiz Affandi Alex Kulesza Emily B. Fox Department of Statistics Dept. of Computer and Information Science Department of Statistics The Wharton School University of Pennsylvania The Wharton School University of Pennsylvania [email protected] University of Pennsylvania [email protected] [email protected] Abstract locations are likewise over-dispersed relative to uniform placement (Bernstein and Gobbel, 1979). Likewise, many A determinantal point process (DPP) is a random practical tasks can be posed in terms of diverse subset selec- process useful for modeling the combinatorial tion. For example, one might want to select a set of frames problem of subset selection. In particular, DPPs from a movie that are representative of its content. Clearly, encourage a random subset Y to contain a diverse diversity is preferable to avoid redundancy; likewise, each set of items selected from a base set Y. For ex- frame should be of high quality. A motivating example we ample, we might use a DPP to display a set of consider throughout the paper is the task of selecting a di- news headlines that are relevant to a user’s inter- verse yet relevant set of news headlines to display to a user. ests while covering a variety of topics. Suppose, One could imagine employing binary Markov random fields however, that we are asked to sequentially se- with negative correlations, but such models often involve lect multiple diverse sets of items, for example, notoriously intractable inference problems. displaying new headlines day-by-day. We might want these sets to be diverse not just individually Determinantal point processes (DPPs), which arise in ran- but also through time, offering headlines today dom matrix theory (Mehta and Gaudin, 1960; Ginibre, 1965) repul- that are unlike the ones shown yesterday. In this and quantum physics (Macchi, 1975), are a class of sive processes paper, we construct a Markov DPP (M-DPP) that and are natural models for subset selection problems where diversity is preferred. DPPs define the models a sequence of random sets fY tg. The proposed M-DPP defines a stationary process that probability of a subset in terms of the determinant of a maintains DPP margins. Crucially, the induced kernel submatrix, and with an appropriate definition of the kernel matrix they can be interpreted as inherently balanc- union process Zt ≡ Y t [Y t−1 is also marginally DPP-distributed. Jointly, these properties imply ing quality and diversity. DPPs are appealing in practice that the sequence of random sets are encouraged since they offer interpretability and tractable algorithms for to be diverse both at a given time step as well as exact inference. For example, one can compute marginal across time steps. We describe an exact, efficient and conditional probabilities and perform exact sampling. sampling procedure, and a method for incremen- DPPs have recently been employed for human pose estima- tally learning a quality measure over items in the tion, search diversification, and document summarization base set Y based on external preferences. We (Kulesza and Taskar, 2010, 2011a,b). apply the M-DPP to the task of sequentially dis- In this paper, our focus is instead on modeling diverse se- playing diverse and relevant news articles to a quences of subsets. For example, in displaying news head- user with topic preferences. lines from day to day, one aims to select articles that are relevant and diverse on any given day. Additionally, it is desirable to select articles that are diverse relative to those 1 INTRODUCTION previously shown. We construct a Markov DPP (M-DPP) for a sequence of random sets fY g. The proposed M-DPP Consider the combinatorial problem of subset selection. Bi- t defines a stationary process that maintains DPP margins, nary Markov random fields are commonly applied in this implying that Y is encouraged to be diverse at time t. Cru- setting, and in the case of positive correlations, yield subsets t cially, the induced union process Z ≡ Y [ Y is also that favor similar items. However, in many applications t t t−1 marginally DPP-distributed. Since this property implies the there is naturally a sense of repulsion. For example, re- diversity of Z , in addition to the individual diversity of pulsive processes arise in nature—trees tend to grow in t Y and Y , we conclude that Y is diverse from Y . the least occupied space (Neeff et al., 2005), and ant hill t t−1 t t−1 DPP M-DPP random set drawn according to P, then for every A ⊆ Y: P(Y ⊇ A) = det(KA) : (1) Here, KA ≡ [KA]i;j2A denotes the submatrix of K in- dexed by elements in A, and we adopt the convention that det(K;) = 1. We will refer to K as the marginal kernel. If we think of Kij as measuring the similarity between items i and j, then Time Time P(Y ⊇ fi; jg) = K K − K2 (2) Figure 1: A set of points on a line (y axis) drawn from ii jj ij a DPP independently over time (left) and from a M-DPP implies that Y is unlikely to contain both i and j when they (right). While DPP points are diverse only within time steps are very similar; that is, a DPP can be seen as modeling a (columns), M-DPP points are also diverse across time steps. collection of diverse items from the base set Y. For an illustration of the improved overall diversity when DPPs can alternatively be constructed via L-ensembles sampling from a M-DPP rather than independent sequential (Borodin and Rains, 2005). An L-ensemble is a probability Y sampling from a DPP, see Fig. 1. measure on 2 defined via a positive semidefinite matrix L indexed by elements of Y: Our specific construction of the M-DPP yields an exact sampling procedure that can be performed in polynomial det(LA) PL(Y = A) = ; (3) time. Additionally, we explore a method for incrementally det(L + I) learning the quality of each item in the base set Y based where I is the N ×N identity matrix. It can be shown that an on externally provided preferences. In particular, a decom- L-ensemble is a DPP with marginal kernel K = L(I+L)−1. position of the DPP kernel matrix has an interpretation as Conversely, a DPP with marginal kernel K has L-ensemble defining the quality of each item and pairwise similarities kernel L = K(I − K)−1 (when the inverse exists). between items. Our incremental learning procedure assumes a well-defined similarity metric and aims to learn features An intuitive way to think of the L-ensemble kernel L is as a of items that a user deems as preferable. These features are Gram matrix (Kulesza and Taskar, 2010): used to define the quality scores for each item. The M-DPP > Lij = qiφi φjqj ; (4) aids in the exploration of items of interest to the user by + providing sequentially diverse results. interpreting qi 2 R as representing the intrinsic quality n of an item i, and φi; φj 2 R as unit length feature vec- We study the empirical behavior of the M-DPP on a news tors representing the similarity between items i and j with task where the goal is to display diverse and high quality > φi φj 2 [−1; 1]. Under this framework, we can model qual- articles. Compared to choosing articles based on quality ity and similarity separately to encourage the DPP to choose alone or to sampling from an independent DPP at each time high quality items that are dissimilar to each other. This is step, we show that the M-DPP produces articles that are very useful in many applications. For example, in response significantly more diverse across time steps without large to a search query we can provide a very relevant (i.e. high sacrifices in quality. Furthermore, within a time step the quality) but diverse (i.e. dissimilar) list of results. M-DPP chooses articles with diversity comparable to that of the independent DPP; this is a direct result of the fact Conditional DPPs For any A; B ⊆ Y with A \ B = ;, that the M-DPP maintains DPP margins. We also consider it is straightforward to show that learning the quality function over time from the feedback of a user with topic preferences. In this setting, the M-DPP det(LA[B) PL(Y = A [ BjY ⊇ A) = ; (5) returns high quality results that are preferred by the user det(L + IYnA) while simultaneously exploring the topic space more quickly where I is a matrix with ones on the diagonal entries than baseline methods, leading to improved coverage. YnA indexed by the elements of Y n A and zeros elsewhere. 2 DETERMINANTAL POINT This conditional distribution is itself a DPP over the ele- PROCESSES ments of Y n A (Borodin and Rains, 2005). In particular, suppose Y is DPP-distributed with L-ensemble kernel L, A random point process P on a discrete base set Y = and condition on the fact that Y ⊇ A. Then the set Y n A f1;:::;Ng is a probability measure on the set 2Y of all is DPP-distributed with marginal and L-ensemble kernels subsets of Y. Let K be a semidefinite matrix with rows KA = I − (L + I )−1 (6) and columns indexed by the elements of Y. P is called a YnA YnA determinantal point process (DPP) if there exists K I −1 LA = (L + I )−1 − I: (7) (all eigenvalues less than or equal to 1) such that if Y is a YnA YnA Here, [·]YnA denotes the submatrix of the argument indexed immediately the joint probability by elements in Y n A.