Publicação

Model selection in discrete clustering: the EM-MML algorithm

Detalhes bibliográficos
Resumo:	Finite mixture models are widely used for cluster analysis in several areas of application. They are commonly estimated through likelihood maximization (using diverse variants of the expectation-maximization algorithm) and the number of components (or clusters) is determined resorting to information criteria: the EM algorithm is run several times and then one of the pre-estimated candidate models is selected (e.g. using the BIC criterion). We propose a new clustering approach to deal with the clustering of categorical data (quite common in social sciences) and simultaneously identify the number of clusters - the EM-MML algorithm. This approach assumes that the data comes from a finite mixture of multinomials and uses a variant of EM to estimate the model parameters and a minimum message length (MML) criterion to estimate the number of clusters. EM-MML thus seamlessly integrates estimation and model selection in a single algorithm. The EM-MML is compared with traditional EM approaches, using alternative information criteria. Comparisons rely on synthetic datasets and also on a real dataset (data from the European Social Survey). The results obtained illustrate the parsimony of the EM-MML solutions as well as their clusters cohesion-separation and stability. A clear advantage of EM-MML is also the computation time.
Autores principais:	Silvestre, Cláudia
Outros Autores:	Cardoso, Margarida; Figueiredo, Mário
Assunto:	Finite mixture models EM-MML algorithm Number of clusters
Ano:	2016
País:	Portugal
Tipo de documento:	documento de conferência
Tipo de acesso:	acesso restrito
Instituição associada:	Instituto Politécnico de Lisboa
Idioma:	inglês
Origem:	Repositório Científico do Instituto Politécnico de Lisboa

Descrição
Resumo:	Finite mixture models are widely used for cluster analysis in several areas of application. They are commonly estimated through likelihood maximization (using diverse variants of the expectation-maximization algorithm) and the number of components (or clusters) is determined resorting to information criteria: the EM algorithm is run several times and then one of the pre-estimated candidate models is selected (e.g. using the BIC criterion). We propose a new clustering approach to deal with the clustering of categorical data (quite common in social sciences) and simultaneously identify the number of clusters - the EM-MML algorithm. This approach assumes that the data comes from a finite mixture of multinomials and uses a variant of EM to estimate the model parameters and a minimum message length (MML) criterion to estimate the number of clusters. EM-MML thus seamlessly integrates estimation and model selection in a single algorithm. The EM-MML is compared with traditional EM approaches, using alternative information criteria. Comparisons rely on synthetic datasets and also on a real dataset (data from the European Social Survey). The results obtained illustrate the parsimony of the EM-MML solutions as well as their clusters cohesion-separation and stability. A clear advantage of EM-MML is also the computation time.