Publicação

Categorical data clustering using a minimum message length criterion

Ver documento

Detalhes bibliográficos
Resumo:Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
Autores principais:Silvestre, Cláudia
Outros Autores:Cardoso, Margarida; Figueiredo, Mário
Assunto:Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion
Ano:2012
País:Portugal
Tipo de documento:documento de conferência
Tipo de acesso:acesso aberto
Instituição associada:Instituto Politécnico de Lisboa
Idioma:inglês
Origem:Repositório Científico do Instituto Politécnico de Lisboa
_version_ 1866887642246807552
author Silvestre, Cláudia
author2 Cardoso, Margarida
Figueiredo, Mário
author2_role author
author
author_facet Silvestre, Cláudia
Cardoso, Margarida
Figueiredo, Mário
author_role author
contributor_name_str_mv RCIPL
country_str PT
creators_json_txt [{\"Person.name\":\"Silvestre, Cláudia\",\"Person.identifier.orcid\":\"0000-0002-8850-4304\"},{\"Person.name\":\"Cardoso, Margarida\"},{\"Person.name\":\"Figueiredo, Mário\"}]
datacite.contributors.contributor.contributorName.fl_str_mv RCIPL
datacite.creators.creator.creatorName.fl_str_mv Silvestre, Cláudia
Cardoso, Margarida
Figueiredo, Mário
datacite.date.Accepted.fl_str_mv 2012-10-01T00:00:00Z
datacite.date.available.fl_str_mv 2014-12-12T12:22:05Z
datacite.date.embargoed.fl_str_mv 2014-12-12T12:22:05Z
datacite.rights.fl_str_mv http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
datacite.titles.title.fl_str_mv Categorical data clustering using a minimum message length criterion
dc.contributor.none.fl_str_mv RCIPL
dc.creator.none.fl_str_mv Silvestre, Cláudia
Cardoso, Margarida
Figueiredo, Mário
dc.date.Accepted.fl_str_mv 2012-10-01T00:00:00Z
dc.date.available.fl_str_mv 2014-12-12T12:22:05Z
dc.date.embargoed.fl_str_mv 2014-12-12T12:22:05Z
dc.format.none.fl_str_mv application/msword
dc.identifier.none.fl_str_mv http://hdl.handle.net/10400.21/4047
dc.language.none.fl_str_mv eng
dc.rights.none.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
dc.title.fl_str_mv Categorical data clustering using a minimum message length criterion
dc.type.none.fl_str_mv http://purl.org/coar/resource_type/c_c94f
description Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
dirty 0
eu_rights_str_mv openAccess
format conferenceObject
fulltext.url.fl_str_mv https://repositorio.ipl.pt/bitstreams/e8ca65a9-fdcd-4f4c-8c38-4a54c56d48db/download
id ripl_abceb7fe507a67371986dcd2b375c4e4
identifier.url.fl_str_mv http://hdl.handle.net/10400.21/4047
instacron_str ipl
institution Instituto Politécnico de Lisboa
instname_str Instituto Politécnico de Lisboa
language eng
network_acronym_str ripl
network_name_str Repositório Científico do Instituto Politécnico de Lisboa
oai_identifier_str oai:repositorio.ipl.pt:10400.21/4047
organization_str_mv urn:organizationAcronym:ipl
person_str_mv Silvestre, Cláudia
Silvestre, Cláudia
https://www.ciencia-id.pt/DA12-EF3F-C7CD
DA12-EF3F-C7CD
http://orcid.org/0000-0002-8850-4304
0000-0002-8850-4304
Cardoso, Margarida
Figueiredo, Mário
publishDate 2012
reponame_str Repositório Científico do Instituto Politécnico de Lisboa
repository_id_str urn:repositoryAcronym:ripl
service_str_mv urn:repositoryAcronym:ripl
spelling engenResearch on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.application/mswordporCategorical data clustering using a minimum message length criterionPersonalSilvestre, CláudiaDSpacehttp://dspace.org/items/08fbc1bf-3387-4137-8c03-c4664dd43375DSpacehttp://dspace.org/items/08fbc1bf-3387-4137-8c03-c4664dd43375SilvestreCláudiaCiência IDhttps://www.ciencia-id.ptDA12-EF3F-C7CDORCIDhttp://orcid.org0000-0002-8850-4304Cardoso, MargaridaFigueiredo, MárioHostingInstitutionOrganizationalRCIPLe-mailmailto:rcaap@sp.ipl.ptrcaap@sp.ipl.pt2014-12-12T12:22:05Z2012-102012-10-01T00:00:00ZHandlehttp://hdl.handle.net/10400.21/4047http://purl.org/coar/access_right/c_abf2open accessCluster analysisCategorical dataExpectation-maximization algorithmMML - Minimum Message Lenght - criterion29696 bytesother research producthttp://purl.org/coar/resource_type/c_c94fconference objecthttp://purl.org/coar/access_right/c_abf2application/mswordfulltexthttps://repositorio.ipl.pt/bitstreams/e8ca65a9-fdcd-4f4c-8c38-4a54c56d48db/downloadThe Eleventh International Symposium on Intelligent Data Analysis (IDA 2012)Helsinki (Finland)
spellingShingle Categorical data clustering using a minimum message length criterion
Silvestre, Cláudia
Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
status SINGLETON
subject.fl_str_mv Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
title Categorical data clustering using a minimum message length criterion
title_full Categorical data clustering using a minimum message length criterion
title_fullStr Categorical data clustering using a minimum message length criterion
title_full_unstemmed Categorical data clustering using a minimum message length criterion
title_short Categorical data clustering using a minimum message length criterion
title_sort Categorical data clustering using a minimum message length criterion
topic Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
topic_facet Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
url http://hdl.handle.net/10400.21/4047
visible 1