Learning the Parameters of Determinantal Point Process Kernels Raja Hafiz Affandi Department of Statistics, University of Pennsylvania Emily B. Fox Department of Statistics, University of Washington Ryan P. Adams Department of Statistics, Harvard Univerity Ben Taskar Department of Computer Science & Engineering, University of Washington RAJARA @ WHARTON . UPENN . EDU EBFOX @ STAT. WASHINGTON . EDU RPA @ SEAS . HARVARD . EDU TASKAR @ CS . WASHINGTON . EDU Abstract Determinantal point processes (DPPs) are wellsuited for modeling repulsion and have proven useful in applications where diversity is desired. While DPPs have many appealing properties, learning the parameters of a DPP is difficult, as the likelihood is non-convex and is infeasible to compute in many scenarios. Here we propose Bayesian methods for learning the DPP kernel parameters. These methods are applicable in largescale discrete and continuous DPP settings, even when the likelihood can only be bounded. We demonstrate the utility of our DPP learning methods in studying the progression of diabetic neuropathy based on the spatial distribution of nerve fibers, and in studying human perception of diversity in images. Formally, given a space Ω ⊆ Rd , a specific point configuration A ⊆ Ω, and a positive semi-definite kernel function L : Ω × Ω → R, the probability density under a DPP with kernel L is given by PL (A) ∝ det(LA ) , (1) where LA is the |A| × |A| matrix with entries L(x, y) for each x, y ∈ A. This defines a repulsive point process since point configurations that are more spread out according to the metric defined by the kernel L have higher densities. Building on work of Kulesza & Taskar (2010), it is intuitive to decompose the kernel L as L(x, y) = q(x)k(x, y)q(y) , (2) 1. Introduction A determinantal point process (DPP) is a distribution over configurations of points. The defining characteristic of the DPP is that it is repulsive, which makes it useful for modeling diversity. Recently, DPPs have played an increasingly important role in machine learning and statistics with applications both in the discrete setting—where they are used as a diverse subset selection method (Affandi et al., 2012; 2013b; Gillenwater et al., 2012; Kulesza & Taskar, 2010; 2011a; Snoek et al., 2013)— and in the continuous setting for generating point configurations that tend to be spread out(Affandi et al., 2013a; Zou & Adams, 2012). Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). where q(x) can be interpreted as the quality function at point x and k(x, y) as the similarity kernel between points x and y. The ability to bias the quality in certain locations while still maintaining diversity via the similarity kernel offers great modeling flexibility. One of the remarkable aspects of DPPs is that they offer efficient algorithms for inference, including computing marginal and conditional probabilities (Kulesza & Taskar, 2012), sampling (Affandi et al., 2013a;b; Hough et al., 2006; Kulesza & Taskar, 2010), and restricting to fixed-sized point configurations (k-DPPs) (Kulesza & Taskar, 2011a). However, an important component of DPP modeling, learning the DPP kernel parameters, is still considered a difficult, open problem. Even in the discrete Ω setting, DPP kernel learning has been conjectured to be NP-hard (Kulesza & Taskar, 2012). Intuitively, the issue arises from the fact that in seeking to maximize the log-likelihood of Eq. (1), the numerator yields a concave log-determinant term and the normalizer a convex term, leading to a non-convex objective. Learning the Parameters of Determinantal Point Process Kernels This non-convexity holds even under various simplifying assumptions on the form of L. Furthermore, when Ω is either a large, discrete set or a continuous subspace, computation of the likelihood is inefficient or infeasible, respectively. This precludes the use of gradient-based and black-box optimization methods. Attempts to partially learn the kernel have been studied by, for example, learning the parametric form of the quality function q(x) for fixed similarity k(x, y) (Kulesza & Taskar, 2011b), or learning a weighting on a fixed set of kernel experts (Kulesza & Taskar, 2011a). So far, the only attempt to learn the parameters of the similarity kernel k(x, y) has used Nelder-Mead optimization (Lavancier et al., 2012), which lacks theoretical guarantees about convergence to a stationary point. Moreover, the use of Nelder-Mead (and other black-box optimization methods) relies heavily on exact computation of the likelihood. In this paper, we consider parametric forms for the quality function q(x) and similarity kernel k(x, y) and propose Bayesian methods to learn the DPP kernel parameters Θ using Markov chain Monte Carlo (MCMC). In addition to capturing posterior uncertainty rather than a single point estimate, our proposed methods apply without approximation to large-scale discrete and continuous DPPs when the likelihood can only be bounded (with any desired precision). In Sec. 2, we review DPPs and their fixed-sized counterpart (k-DPPs). We then explore maximum likelihood estimation (MLE) algorithms for learning DPP and k-DPP kernels. After examining the shortcomings of the MLE approach, we propose a set of techniques for Bayesian posterior inference of the kernel parameters in Sec. 3. In Sec. 4, we derive a set of DPP moments that can be used for model assessment, MCMC convergence diagnostics, and in low-dimensional settings for learning kernel parameters via numerical techniques. Finally, in Sec. 5 we use DPP learning to study the progression of diabetic neuropathy based on the spatial distribution of nerve fibers and also to study human perception of diversity of images. bility distribution which gives positive mass only to subsets of a fixed size, k. In these cases, we consider fixed-sized DPPs (or k-DPPs) with probability distribution on sets A of cardinality k given by k PL (A) = det(LA ) , ek (λ1 , . . . , λN ) (4) where λ1 , . . . , λN are the eigenvalues of L and ek (λ1 , . . . , λN ) is the kth elementary symmetric polynomial (Kulesza & Taskar, 2011a). Note that ek (λ1 , . . . , λN ) can be efficiently computed using recursion (Kulesza & Taskar, 2012). 2.2. Continuous DPPs/k-DPPs Consider now the case where Ω ⊆ Rd is a continuous space. DPPs extend to this case naturally, with L now a kernel operator instead of a matrix. Again appealing to Eq. (1), the DPP probability density for point configurations A ⊂ Ω is given by det(LA ) , (5) PL (A) = ∞ n=1 (λn + 1) where λ1 , λ2 , . . . are eigenvalues of the operator L. The k-DPP also extends to the continuous case with k PL (A) = det(LA ) , ek (λ1:∞ ) (6) where λ1:∞ = (λ1 , λ2 , . . .). In contrast to the discrete case, the eigenvalues λi for continuous DPP kernels are generally unknown; exceptions include a few kernels such as the Gaussian. 3. Learning Parametric DPPs Assume that we are given a training set consisting of samples A1 , A2 , . . . , AT , and that we model these data using a DPP/k-DPP with parametric kernel L(x, y; Θ) = q(x; Θ)k(x, y; Θ)q(y; Θ) , (7) 2. Background 2.1. Discrete DPPs/k-DPPs For a discrete base set Ω = {x1 , x2 , . . . , xN }, a DPP defined by an N × N positive semi-definite kernel matrix L is a probability measure on the 2N possible subsets A of Ω: PL (A) = det(LA ) . det(L + I) (3) with parameters Θ. We denote the associated kernel matrix for a set At by LAt (Θ) and the full kernel matrix/operator by L(Θ). Likewise, we denote the kernel eigenvalues by λi (Θ). In this section, we explore various methods for DPP/k-DPP learning. 3.1. Learning using Optimization Methods To learn the parameters Θ of a discrete DPP model, recalling Eq. (3) we can maximize the log-likelihood T Here, LA ≡ [Lij ]xi ,xj ∈A is the submatrix of L indexed by the elements in A and I is the N × N identity matrix (Borodin & Rains, 2005). In many applications, we are instead interested in the proba- L(Θ) = t=1 log det(LAt (Θ)) − T log det(L(Θ) + I) . Learning the Parameters of Determinantal Point Process Kernels Lavancier et al. (2012) suggest using the Nelder-Mead simplex algorithm (Nelder & Mead, 1965). This method evaluates the objective function at the vertices of a simplex, then iteratively shrinks the simplex towards an optimal point. Although straightforward, this procedure does not necessarily converge to a stationary point (McKinnon, 1998). Gradient ascent and stochastic gradient ascent are attractive due to their theoretical guarantees, but require knowledge of the gradient of L(Θ). In the discrete DPP setting, this gradient can be computed straightforwardly, and we provide examples for discrete Gaussian and polynomial kernels in the Supplement. We note, however, that both of these methods are susceptible to convergence to local optima due to the non-convex likelihood landscape. Furthermore, these methods (and many other black-box optimization techniques) require that the likelihood is known exactly. From the determinant in the denominator of Eq. (3), we see that when the number of base items N is large, computing the likelihood or its derivative is inefficient. A similar inefficiency arises when we expect large sets At , as determined by Θ. Both of these challenges limit the general applicability of these MLE approaches. Instead, in Sec. 3.3, we develop a Bayesian method that only requires an upper and lower bound on the likelihood. We focus on the large N challenge and discuss in the Supplement how analogous methods can be used for handling large observation sets, At . The log-likelihood of the k-DPP kernel parameter is T distribution over kernel parameters: P(Θ|A1 , . . . , AT ) ∝ P(Θ) for the DPP and, for the k-DPP, det(LAt (Θ)) . e (λ (Θ), . . . , λN (Θ)) t=1 k 1 (9) Here, P(Θ) is the prior on Θ. Since neither Eq. (8) nor Eq. (9) yield a closed-form posterior, we resort to approximate techniques based on Markov chain Monte Carlo (MCMC). We highlight two techniques: random-walk Metropolis-Hastings (MH) and slice sampling. We note, however, that other MCMC methods can be employed without loss of generality, and may be more efficient in some scenarios. P(Θ|A1 , . . . , AT ) ∝ P(Θ) In random-walk MH, we use a proposal distribuˆ ˆ tion f (Θ|Θi ) to generate a candidate value Θ given the current parameters Θi , which are then accepted or rejected with probability min{r, 1} where r= ˆ ˆ P(Θ|A1 , . . . , AT ) f (Θi |Θ) 1 , . . . , AT ) ˆ i) P(Θi |A f (Θ|Θ . (10) T det(LAt (Θ)) det(L(Θ) + I) t=1 T (8) ˆ The proposal distribution f (Θ|Θi ) is chosen to have mean ˆ Θi . The hyperparameters of f (Θ|Θi ) tune the width of the distribution, determining the average step size. See Alg. 1 of the Supplement. While random-walk MH can provide a straightforward means of sampling from the posterior, its efficiency requires tuning the proposal distribution. Choosing an aggressive proposal can result in a high rejection rate, while choosing a conservative proposal can result in inefficient exploration of the parameter space. To avoid the need to tune the proposal distribution, we can instead use slice sampling (Neal, 2003). We first describe this method in the univariate case, following the “linear stepping-out” approach described in Neal (2003). Given the current parameter Θi , we first sample y ∼ Uniform[0, P(Θi |A1 , . . . , AT )]. This defines our slice with all values of Θ with P(Θ|A1 , . . . , AT ) greater than y included in the slice. We then define a random interval around Θi with width w that is linearly expanded until neiˆ ther endpoint is in the slice. We propose Θ uniformly in the ˆ is in the slice, it is accepted. Otherwise, Θ ˆ interval. If Θ becomes the new boundary of the interval, shrinking it so as to still include the current state of the Markov chain. This ˆ procedure is repeated until a proposed Θ is accepted. See Alg. 2 of the Supplement. There are many ways to extend this algorithm to a multidimensional setting. We consider the simplest extension proposed by Neal (2003) where we use hyperrectangles instead L(Θ) = t=1 log det(LAt (Θ)) − T log |B|=k det(LB (Θ)) , which presents an addition complication due to needing a sum over n terms in the gradient. k For continuous DPPs/k-DPPs, once again, both MLE optimization-based methods require that the likelihood is computable. Recalling Eq. (5), we note the infinite product in the denominator. As such, for kernel operators with infinite rank (such as the Gaussian), we are forced to consider approximate MLE methods based on an explicit truncation of the eigenvalues. Gradient ascent using such truncations further relies on having a known eigendecomposition with a differentiable form for the eigenvalues. Unfortunately, such approximate gradients are not unbiased estimates of the true gradient, so the theory associated with stochastic gradient based approaches does not hold. 3.2. Bayesian Learning for Discrete DPPs Instead of optimizing the likelihood to get an MLE, we propose a Bayesian approach to estimating the posterior Learning the Parameters of Determinantal Point Process Kernels Σ11−MH Sample Autocorrelation 1 Sample Autocorrelation 1 Σ11−Slice Sampling 0.5 0.5 0 0 5 10 Lag 15 20 0 0 5 10 Lag 15 20 probability is sufficient as long as we can control the accuracy of these bounds. We denote the upper and lower bounds by P + (Θ|A1 , . . . , AT ) and P − (Θ|A1 , . . . , AT ), respectively. In the random-walk MH algorithm we can then compute the upper and lower bounds on the acceptance ratio, r+ = r− = ˆ ˆ P + (Θ|A1 , . . . , AT ) f (Θi |Θ) ˆ P − (Θi |A1 , . . . , AT ) f (Θ|Θi ) ˆ ˆ P − (Θ|A1 , . . . , AT ) f (Θi |Θ) + (Θ |A1 , . . . , AT ) ˆ P i f (Θ|Θi ) . (13) Figure 1. Sample autocorrelation function for posterior samples of the slowest mixing kernel parameter in Eq. (11) and Eq. (12), sampled using MH and slice sampling. (14) of intervals. A hyperrectangle region is constructed around Θi and the edge in each dimension is expanded or shrunk depending on whether its endpoints lie inside or outside the slice. One could alternatively consider coordinate-wise or random-direction approaches to multidimensional slice sampling. As an illustrative example, we consider synthetic data generated from a two-dimensional discrete DPP with 1 q(xi ) = exp − xi Γ−1 xi (11) 2 1 k(xi , xj ) = exp − (xi −xj ) Σ−1 (xi −xj ) , (12) 2 where Γ = diag(0.5, 0.5) and Σ = diag(0.1, 0.2). We consider Ω to be a grid of 100 points evenly spaced in a 10 × 10 unit square and simulate 100 samples from this DPP. We then condition on these simulated data and perform posterior inference of the kernel parameters using MCMC. Fig. 1 shows the sample autocorrelation function of the slowest mixing parameter, Σ11 , learned using random-walk MH and slice sampling. Furthermore, we ran a Gelman-Rubin test (Gelman & Rubin, 1992) on five chains starting from overdispersed starting positions and found that the average partial scale reduction function across the four parameters to be 1.016 for MH and 1.023 for slice sampling, indicating fast mixing of the posterior samples. 3.3. Bayesian Learning for Large-Scale Discrete and Continuous DPPs When the number of items, N , for discrete Ω is large or when Ω is continuous, evaluating the normaliz∞ ers det(L(Θ) + I) or n=1 (λn (Θ) + 1), respectively, can be inefficient or infeasible. Even in cases where an explicit form of the truncated eigenvalues can be computed, this will only lead to approximate MLE solutions, as discussed in Sec. 3.1. On the surface, it seems that most MCMC algorithms will suffer from the same problem since they require knowledge of the likelihood as well. However, we argue that for most of these algorithms, an upper and lower bound of the posterior The threshold u ∼ Uniform[0, 1] can be precomputed, so ˆ we can often accept or reject the proposal Θ even if these bounds have not completely converged. All that is necessary is for u < min{1, r− } (immediately reject) or u > min{1, r+ } (immediately accept). In the case that u ∈ (r− , r+ ), we can perform further computations to increase the accuracy of our bounds until a decision can be made. As we only sample u once in the beginning, this iterative procedure yields a Markov chain with the exact target posterior as its stationary distribution; all we have done is “short-circuit” the computation once we have bounded the acceptance ratio r away from u. We show this procedure in Alg. 3 of the Supplement. The same idea applies to slice sampling. In the first step of generating a slice, instead of sampling y ∼ Uniform[0, P(Θi |A1 , . . . , AT )], we use a rejection sampling scheme to first propose a candidate slice y ∼ Uniform[0, P + (Θi |A1 , . . . , AT )] . ˆ (15) We then decide whether y < P − (Θi |A1 , . . . , AT ), in ˆ which case we know y < P(Θi |A1 , . . . , AT ) and we ˆ accept y as the slice and set y = y . In the case ˆ ˆ where y ∈ (P − (Θi |A1 , . . . , AT ), P + (Θi |A1 , . . . , AT )), ˆ we keep increasing the tightness of our bounds until a decision can be made. If at any point y exceeds the ˆ newly computed P + (Θi |A1 , . . . , AT ), we know that y > ˆ P(Θi |A1 , . . . , AT ) so we reject the proposal. In this case, we generate a new y and repeat. ˆ Upon accepting a slice y, the subsequent steps for proposing ˆ a parameter Θ proceed in a similarly modified manner. For the interval computation, the endpoints Θe are each examined to decide whether y < P − (Θe |A1 , . . . , AT ) (endpoint is not in slice) or y > P + (Θe |A1 , . . . , AT ) (endpoint is in slice). The tightness of the posterior bounds is increased until a decision can be made and the interval adjusted, if need ˆ be. After convergence, Θ is generated uniformly over the interval and is likewise tested for acceptance. We illustrate this procedure in Fig. 1 of the Supplement. The lower and upper posterior probability bounds can be incorporated in many MCMC algorithms, and provide an Learning the Parameters of Determinantal Point Process Kernels Log of 10−DPP Normalizer Log of DPP Normalizer effective means of garnering posterior samples assuming the bounds can be efficiently tightened. For DPPs, the upper and lower bounds depend on the truncation of the kernel eigenvalues and can be arbitrarily tightened by including more terms. In the discrete DPP/k-DPP settings, the eigenvalues can be efficiently computed to a specified point using methods such as power law iterations. The corresponding bounds for a 3600 × 3600 Gaussian kernel example are shown in Fig. 2. In the continuous setting, explicit truncation can be done when the kernel has Gaussian quality and similarity, as we show in Sec. 5.1. For other continuous DPP kernels, low-rank approximations can be used (Affandi et al., 2013a) resulting in approximate posterior samples (even after convergence of the Markov chain). We believe these methods could be used to get exact posterior samples by extending the discrete-DPP Nystr¨ m theory of Affandi et al. (2013b), o but this is beyond the scope of this paper. In contrast, a gradient ascent algorithm for MLE is not even feasible: we do not know the form of the approximated eigenvalues, so we cannot take their derivative. Explicit forms for the DPP/k-DPP posterior probability bounds as a function of the eigenvalue truncations follow from Prop. 3.1 and 3.2 combined with Eqs. (8) and (9), respectively. Proofs are in the Supplement. Proposition 3.1. Let λ1:∞ be the eigenvalues of kernel L. Then M ∞ 1500 Lower Bound Upper Bound Exact Log Normalizer 120 100 80 60 40 0 Lower Bound Upper Bound Exact Log Normalizer 1000 2000 N 3000 4000 1000 500 0 0 1000 2000 N 3000 4000 Figure 2. Normalizer bounds for a discrete DPP (left) and a 10DPP (right) with Gaussian quality and similarity as in Eqs. (11) and (12) and Ω a grid of 3600 points. samples and use these as summary statistics in assessing convergence of the MCMC sampler, e.g., via Gelman-Rubin diagnostics (Gelman & Rubin, 1992). Likewise, if we observe that these posterior-sample-based moments do not cover the empirical moments of the data, this can usefully hint at a lack of posterior consistency and a potential need to revise the misspecified prior. In the discrete case, we first need to compute the marginal probabilities. Borodin (2009) shows that the marginal kernel, K, can be computed directly from L: K = L(I + L)−1 . The mth moment can then be calculated via N (18) (1 + λn ) ≤ n=1 (1 + λn ) n=1 (16) E[xm ] = i=1 xm K(xi , xi ) . i (19) and ∞ M M (1 + λn ) ≤ exp tr(L) − n=1 n=1 λn (1 + λn ) . n=1 In the continuous case, given the eigendecomposition of the ∞ kernel operator, L(x, y) = n=1 λn φn (x)∗ φn (y) (where ∗ φn (x) denotes the complex conjugate of the nth eigenfunction), the mth moment is E[xm ] = λn xm φn (x)2 dx . λn + 1 Ω n=1 ∞ Proposition 3.2. Let λ1:∞ be the eigenvalues of kernel L. Then ek (λ1:M ) ≤ ek (λ1:∞ ) (17) and k (20) ek (λ1:∞ ) ≤ j=0 (tr(L) − j! M n=1 λn )j ek−j (λ1:M ) . Note that the expression tr(L) in the bounds can be easN ily computed as either i=1 Lii in the discrete case or L(x, x)dx in the continuous case. Ω Note that Eq. (20) generally cannot be evaluated in closed form since the eigendecompositions of most kernel operators are not known. However, in certain cases, such as the Gaussian kernel of Sec. 5.1 with eigenfunctions given by Hermite polynomials, the moments can be directly computed. In the Supplement, we derive the mth moment for this Gaussian kernel setting. Unfortunately, the method of moments can be challenging to use for direct parameter learning since Eqs. (19) and (20) rarely yield analytic forms that are solvable for Θ. In low dimensions, Θ can be estimated numerically, but it is an open question to estimate these moments for large-scale problems. 4. Method of Moments In this section, we derive a set of DPP moments that can be used in a variety of ways. For example, we can compute the theoretical moments associated with each of our posterior Learning the Parameters of Determinantal Point Process Kernels 10000 1.4 1.2 5000 α 1 0.8 0 0 1000 0.8 8 0.6 500 α ρ σ 0.4 0.2 0 0 5000 10000 Slice Sampling Iterations 0 0 5000 10000 Slice Sampling Iterations 6 4 0 5000 10000 Slice Sampling Iterations 5000 10000 Slice Sampling Iterations 0 5000 10000 Slice Sampling Iterations 0.5 0 10 5000 10000 Slice Sampling Iterations −3 2.5 2 1.5 1 σ Zeroth Moment 20 2nd Moment 5000 10000 Slice Sampling Iterations 19 18 17 16 15 0 85 Zeroth Moment 80 75 70 65 0 2nd Moment 30 25 ρ 20 0 5000 10000 Slice Sampling Iterations x 10 40 30 20 10 0 0 5000 10000 Slice Sampling Iterations 5000 10000 Slice Sampling Iterations Figure 3. For a continuous DPP with Gaussian quality and similarity, from left to right: Posterior samples of α, ρ and σ, and associated zeroth and second moments. The top row are samples from Scenario (i) (blue) and Scenario (ii) (green) while the second row are samples from Scenario (iii). Red lines indicate the true parameter values that generated the data and their associated theoretical moments. The y-axis scaling aims to place all scenarios on equal footing. 5. Experiments 5.1. Simulations We provide an explicit example of Bayesian learning for a continuous DPP with the kernel defined by q(x) = √ D (ii) 1000 DPP samples with average number of points=18 using (α, ρ, σ) = (1000, 1, 1) (iii) 10 DPP samples with average number of points=77 using (α, ρ, σ) = (100, 0.7, 0.05). Fig. 3 shows trace plots of the posterior samples for all three scenarios. In the first scenario, the parameter estimates vary wildly whereas in the other two scenarios, the posterior estimates are more stable. In all cases, the zeroth and second moment estimated from the posterior samples are in the neighborhood of the corresponding empirical moments. This leads us to believe that the posterior is broad when we have both a small number of samples and few points in each sample. The posterior becomes more peaked as the total number of points increases. The stationary similarity kernel allows us to garner information either from few sets with many points or many sets of few points. Dispersion Measure In many applications, we are interested in quantifying the overdispersion of point process data. In spatial statistics, a standard dispersion measure is the Ripley K-function (Ripley, 1977). We instead aim to use the learned DPP parameters (encoding repulsion) to quantify overdispersion. Importantly, our measure should be invariant to scaling. In the Supplement we derive that, as the data are scaled from x to ηx, the parameters scale from (α, σi , ρi ) to (α, ησi , ηρi ). This suggests that an appropriate scale-invariant repulsion measure is γi = σi /ρi . 5.2. Applications 5.2.1. D IABETIC N EUROPATHY Recent breakthroughs in skin tissue imaging have spurred interest in studying the spatial patterns of nerve fibers in diabetic patients. It has been observed that these nerve fibers become more clustered as diabetes progresses. Waller et al. (2011) previously analyzed this phenomena based on 6 thigh α d=1 √ 1 exp − πρd 2ρd (xd − yd )2 2σd x2 d (21) D k(x, y) = d=1 exp − , x, y ∈ RD . (22) Here, Θ = {α, ρd , σd } and the eigenvalues of the operator L(Θ) are given by (Fasshauer & McCourt, 2012), D λm (Θ) = α d=1 1 2 βd +1 2 + 1 2γd 2 γd (βd 1 + 1) + 1 md −1 , (23) 1 d where γd = σd , βd = (1 + γ2d ) 4 , and m = (m1 , . . . , mD ) ρ is a multi-index. Furthermore, the trace of L(Θ) can be easily computed as D tr(L(Θ)) = Rd α d=1 1 x2 exp − d πρd 2ρd dx = α . (24) We test our Bayesian learning algorithms on simulated data generated from a 2-dimensional isotropic kernel (σd = σ, ρd = ρ for d = 1, 2) using Gibbs sampling (Affandi et al., 2013a). We then learn the parameters under weakly informative inverse gamma priors on σ, ρ and α. Details are in the Supplement. We consider the following simulation scenarios: (i) 10 DPP samples with average number of points=18 using (α, ρ, σ) = (1000, 1, 1) Learning the Parameters of Determinantal Point Process Kernels Normal Subject Mildly Diabetic Subject 1 Mildly Diabetic Subject 2 set of images that are not only relevant to the query, but diverse as well. Building on work by Kulesza & Taskar (2011a), three image categories—cars, dogs and cities—were studied. Within each category, 8-12 subcategories (such as Ford for cars, London for cities and poodle for dogs) were queried from Google Image Search and the top 64 results were retrieved. For a subcategory subcat, these images form our base set Ωsubcat . To assess human perception of diversity, annotated sets of size six were generated from these base sets. However, it is challenging to ask a human to coherently select six diverse images from a set of 64 total. Instead, Kulesza & Taskar (2011a) generated a partial result set of five images from a 5-DPP on each Ωsubcat with a kernel based on the SIFT256 features (see the Supplement). Human annotators (via Amazon Mechanical Turk) were then presented with two images selected at random from the remaining subcategory images and asked to add the image they felt was least similar to the partial result set. These experiments resulted in about 500 samples spread evenly across the different subcategories. We aim to study how the human annotated sets differ from the top six Google results, Top-6. As in Kulesza & Taskar (2011a), we extracted three types of features from the images—color features, SIFT descriptors (Lowe, 1999; Vedaldi & Fulkerson, 2010) and GIST descriptors (Oliva & Torralba, 2006) described in the Supplement. We denote these features for image i as ficolor , fiSIFT , and fiGIST , respectively. For each subcategory, we model our data as a discrete 6-DPP on Ωsubcat with kernel Lsubcat = exp − i,j feat feat fifeat − fj cat σfeat 2 2 Severely Diabetic Subject 2 Severely Diabetic Subject 1 Moderately Diabetic Subject Figure 4. Nerve fiber samples. Clockwise: (i) Normal subject, (ii) Mildly Diabetic Subject 1, (iii) Mildly Diabetic Subject 2,(iv) Moderately Diabetic subject, (v) Severely Diabetic Subject 1 and (vi) Severely Diabetic Subject 2. nerve fiber samples. These samples were collected from 5 diabetic patients at different stages of diabetic neuropathy and one healthy subject. On average, there are 79 points in each sample (see Fig. 4). Waller et al. (2011) analyzed the Ripley K-function and concluded that the difference between the healthy and severely diabetic samples is highly significant. We instead study the differences between these samples by learning the kernel parameters of a DPP and quantifying the level of repulsion of the point process. Due to the small sample size, we consider a 2-class study of Normal/Mildly Diabetic versus Moderately/Severely Diabetic. We perform two analyses. In the first, we directly quantify the level of repulsion based on our scale-invariant statistic, γ = σ/ρ (see Sec. 5.1). In the second, we perform a leave-one-out classification by training the parameters on the two classes with one sample left out. We then evaluate the likelihood of the held-out sample under the two learned classes. We repeat this for all six samples. We model our data using a 2-dimensional continuous DPP with Gaussian quality and similarity as in Eqs. (21) and (22). Since there is no observed preferred direction in the data, we use an isotropic kernel (σd = σ and ρd = ρ for d = 1, 2). We place weakly informative inverse gamma priors on (α, ρ, σ), as specified in the Supplement, and learn the parameters using slice sampling with eigenvalue bounds as outlined in Sec. 3.3. The results shown in Fig. 5 indicate that our γ measure clearly separates the two classes, concurring with the results of Waller et al. (2011). Furthermore, we are able to correctly classify all six samples. While the results are preliminary, being based on only six observations, they show promise for this task. 5.2.2. D IVERSITY IN I MAGES We also examine DPP learning for quantifying how visual features relate to human perception of diversity in different image categories. This is useful in applications such as image search, where it is desirable to present users with a (25) for feat ∈ {color, SIFT, GIST} and i, j indexing the 64 images in Ωsubcat . Here, we assume that each category has cat the same parameters across subcategories, namely, σfeat for subcat ∈ cat and cat ∈ {cars, dogs, cities}. To learn from the Top-6 images, we consider the samples as being generated from a 6-DPP. To emphasize the human component of the 5-DPP + human annotation sets, we examine a conditional 6-DPP (Kulesza & Taskar, 2012) that fixes the five partial results set images and only considers the probability of adding the human-annotated image. The Supplement provides details on this conditional k-DPP. All subcategory samples within a category are assumed to be independent draws from a DPP on Ωsubcat with kernel cat Lsubcat parameterized by a shared set of σfeat . As such, cat each of these samples equally informs the posterior of σfeat . We samples the posterior of the 6-DPP or conditional 6DPP kernel parameters using slice sampling with weakly cat informative inverse gamma priors on the σfeat . Details are in the Supplement. Learning the Parameters of Determinantal Point Process Kernels Normal Subject Posterior Probability Posterior Probability 170 165 160 155 Normal/Mild γ 10 Posterior Probability 5 0 Normal/Mild Moderate/Severe 115 110 105 100 95 Normal/Mild Moderate/Severe Normal/Mild Moderate/Severe Normal/Mild Moderate/Severe Moderate/Severe 160 Mildly Diabetic Subject 1 Posterior Probability 175 170 165 160 Normal/Mild 75 70 65 60 55 Moderate/Severe Mildly Diabetic Subject 2 155 x 10 15 −3 150 Normal/Mild Moderate/Severe Severely Diabetic Subject 2 Posterior Probability 100 90 80 Severely Diabetic Subject 1 Posterior Probability Cars−Color 1000 800 600 500 400 200 0 DPP/Turk Google Top 6 0 DPP/Turk 1000 Moderately Diabetic Subject Figure 5. Left: Repulsion measure, γ. Right: Leave-one-out log-likelihood of each sample in Fig. 4 (in the same order) under the two learned DPP classes: Normal/Mildly Diabetic (left box) and Moderately/Severely Diabetic (right box). cat Fig. 6 shows a comparison between σfeat learned from the human annotated samples (conditioning on the 5-DPP partial result sets) and the Top-6 samples for different categories. The results indicate that the 5-DPP + human annotated samples differs significantly from the Top-6 samples in the features judged by human to be important for diversity in each category. For cars and dogs, human annotators deem color to be a more important feature for diversity than the Google search engine based on their Top-6 results. For cities, on the other hand, the SIFT features are deemed more important by human annotators than by Google. Keep in mind, though, that this result only highlights the diversity components of the results while ignoring quality. In real life applications, it is desirable to combine both the quality of each image (as a measure of relevance of the image to the query) and the diversity between the top results. Regardless, we have shown that DPP kernel learning can be informative of judgements of diversity, and this information could be used (for example) to tune search engines to provide results more in accordance with human judgement. Cars−SIFT 1000 Cars−GIST 500 σ σ σ 0 Google Top 6 1000 DPP/Turk Google Top 6 Dogs−Color 1000 Dogs−SIFT Dogs−GIST 500 500 500 σ σ 0 DPP/Turk Google Top 6 σ 0 DPP/Turk Google Top 6 0 DPP/Turk Google Top 6 Cities−Color 1000 800 600 σ σ Cities−SIFT 1000 Cities−GIST σ 500 400 200 500 0 DPP/Turk Google Top 6 0 DPP/Turk Google Top 6 0 DPP/Turk Google Top 6 Figure 6. For the image diversity experiment, boxplots of posterior cat cat cat samples of (rom left to right) σcolor , σSIFT and σGIST . Each plot shows results for human annotated sets (left) versus Google Top 6 (right). Categories from top to bottom: (a) cars, (b) dogs and (c) cities. Acknowledgments This research was supported in part by AFOSR Grant FA9550-12-1-0453 and DARPA Young Faculty Award N66001-12-1-4219. 6. Conclusion Determinantal point processes have become increasingly popular in machine learning and statistics. While many important DPP computations are efficient, learning the parameters of a DPP kernel is difficult. This is due to the fact that not only is the likelihood function non-convex, but in many scenarios the likelihood and its gradient are either unknown or infeasible to compute. We proposed Bayesian approaches using MCMC for inferring these parameters. In addition to providing a characterization of the posterior uncertainty, these algorithms can be used to deal with large-scale discrete and continuous DPPs based solely on likelihood bounds. We demonstrated the utility of learning DPP parameters in studying diabetic neuropathy and evaluating human perception of diversity in images. Learning the Parameters of Determinantal Point Process Kernels References Affandi, R. H., Kulesza, A., and Fox, E. B. Markov determinantal point processes. In Proc. UAI, 2012. Affandi, R. H., Fox, E.B., and Taskar, B. Approximate inference in continuous determinantal processes. In Proc. NIPS, 2013a. Affandi, R.H., Kulesza, A., Fox, E.B., and Taskar, B. Nystr¨ m approximation for large-scale determinantal proo cesses. In Proc. AISTATS, 2013b. Borodin, A. Determinantal point processes. arXiv preprint arXiv:0911.1153, 2009. Borodin, A. and Rains, E.M. Eynard-Mehta theorem, Schur process, and their Pfaffian analogs. Journal of Statistical Physics, 121(3):291–317, 2005. Fasshauer, G.E. and McCourt, M.J. Stable evaluation of Gaussian radial basis function interpolants. SIAM Journal on Scientific Computing, 34(2):737–762, 2012. Gelman, A. and Rubin, D. B. Inference from iterative simulation using multiple sequences. Statistical Science, pp. 457–472, 1992. Gillenwater, J., Kulesza, A., and Taskar, B. Discovering diverse and salient threads in document collections. In Proc. EMNLP, 2012. Hough, J.B., Krishnapur, M., Peres, Y., and Vir´ g, B. Detera minantal processes and independence. Probability Surveys, 3:206–229, 2006. Kulesza, A. and Taskar, B. Structured determinantal point processes. In Proc. NIPS, 2010. Kulesza, A. and Taskar, B. k-DPPs: Fixed-size determinantal point processes. In Proc. ICML, 2011a. Kulesza, A. and Taskar, B. Learning determinantal point processes. In Proc. UAI, 2011b. Kulesza, A. and Taskar, B. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2–3), 2012. Lavancier, F., Møller, J., and Rubak, E. Statistical aspects of determinantal point processes. arXiv preprint arXiv:1205.4818, 2012. Lowe, D. G. Object recognition from local scale-invariant features. In Proc. IEEE International Conference on Computer Vision, 1999. McKinnon, K. I.M. Convergence of the Nelder–Mead simplex method to a nonstationary point. SIAM Journal on Optimization, 9(1):148–158, 1998. Neal, R. M. Slice sampling. Annals of Statistics, pp. 705– 741, 2003. Nelder, J. and Mead, R. A simplex method for function minimization. Computer Journal, 7(4):308–313, 1965. Oliva, A. and Torralba, A. Building the gist of a scene: The role of global image features in recognition. Progress in Brain Research, 155:23–36, 2006. Ripley, B. D. Modelling spatial patterns. Journal of the Royal Statistical Society. Series B (Methodological), pp. 172–212, 1977. Snoek, J., Zemel, R., and Adams, R. P. A determinantal point process latent variable model for inhibition in neural spiking data. In Proc. NIPS, 2013. Vedaldi, A. and Fulkerson, B. Vlfeat: An open and portable library of computer vision algorithms. In Proc. International Conference on Multimedia, 2010. Waller, L. A., S¨ rkk¨ , A., Olsbo, V., Myllym¨ ki, M., a a a Panoutsopoulou, I.G., Kennedy, W.R., and WendelschaferCrabb, G. Second-order spatial analysis of epidermal nerve fibers. Statistics in Medicine, 30(23):2827–2841, 2011. Zou, J. and Adams, R.P. Priors for diversity in generative latent variable models. In Proc. NIPS, 2012.