A data-driven selection of the number of clusters in the Dirichlet allocation model via Bayesian mixture modelling |
| |
Authors: | E F Saraiva C A B Pereira A K Suzuki |
| |
Institution: | 1. Mathematics Institute, Federal University of Mato Grosso do Sul, Campo Grande, Brazil;2. Institute of Mathematics and Statistics, University of S?o Paulo, S?o Paulo, Brazil;3. Sciences Institute of Mathematics and Computers, University of S?o Paulo, S?o Carlos, Brazil |
| |
Abstract: | In this paper, we consider a Bayesian mixture model that allows us to integrate out the weights of the mixture in order to obtain a procedure in which the number of clusters is an unknown quantity. To determine clusters and estimate parameters of interest, we develop an MCMC algorithm denominated by sequential data-driven allocation sampler. In this algorithm, a single observation has a non-null probability to create a new cluster and a set of observations may create a new cluster through the split-merge movements. The split-merge movements are developed using a sequential allocation procedure based in allocation probabilities that are calculated according to the Kullback–Leibler divergence between the posterior distribution using the observations previously allocated and the posterior distribution including a ‘new’ observation. We verified the performance of the proposed algorithm on the simulated data and then we illustrate its use on three publicly available real data sets. |
| |
Keywords: | Mixture model Bayesian approach Gibbs sampling Metropolis–Hastings split-merge update Kullback–Leibler divergence |
|
|