Publicação

Categorical data clustering using a minimum message length criterion

Detalhes bibliográficos
Resumo:	Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
Autores principais:	Silvestre, Cláudia
Outros Autores:	Cardoso, Margarida; Figueiredo, Mário
Assunto:	Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion
Ano:	2012
País:	Portugal
Tipo de documento:	documento de conferência
Tipo de acesso:	acesso aberto
Instituição associada:	Instituto Politécnico de Lisboa
Idioma:	inglês
Origem:	Repositório Científico do Instituto Politécnico de Lisboa

_version_	1866887642246807552
author	Silvestre, Cláudia
author2	Cardoso, Margarida Figueiredo, Mário
author2_role	author author
author_facet	Silvestre, Cláudia Cardoso, Margarida Figueiredo, Mário
author_role	author
contributor_name_str_mv	RCIPL
country_str	PT
creators_json_txt	[{\"Person.name\":\"Silvestre, Cláudia\",\"Person.identifier.orcid\":\"0000-0002-8850-4304\"},{\"Person.name\":\"Cardoso, Margarida\"},{\"Person.name\":\"Figueiredo, Mário\"}]
datacite.contributors.contributor.contributorName.fl_str_mv	RCIPL
datacite.creators.creator.creatorName.fl_str_mv	Silvestre, Cláudia Cardoso, Margarida Figueiredo, Mário
datacite.date.Accepted.fl_str_mv	2012-10-01T00:00:00Z
datacite.date.available.fl_str_mv	2014-12-12T12:22:05Z
datacite.date.embargoed.fl_str_mv	2014-12-12T12:22:05Z
datacite.rights.fl_str_mv	http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv	Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion
datacite.titles.title.fl_str_mv	Categorical data clustering using a minimum message length criterion
dc.contributor.none.fl_str_mv	RCIPL
dc.creator.none.fl_str_mv	Silvestre, Cláudia Cardoso, Margarida Figueiredo, Mário
dc.date.Accepted.fl_str_mv	2012-10-01T00:00:00Z
dc.date.available.fl_str_mv	2014-12-12T12:22:05Z
dc.date.embargoed.fl_str_mv	2014-12-12T12:22:05Z
dc.format.none.fl_str_mv	application/msword
dc.identifier.none.fl_str_mv	http://hdl.handle.net/10400.21/4047
dc.language.none.fl_str_mv	eng
dc.rights.none.fl_str_mv	http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv	Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion
dc.title.fl_str_mv	Categorical data clustering using a minimum message length criterion
dc.type.none.fl_str_mv	http://purl.org/coar/resource_type/c_c94f
description	Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
dirty	0
eu_rights_str_mv	openAccess
format	conferenceObject
fulltext.url.fl_str_mv	https://repositorio.ipl.pt/bitstreams/e8ca65a9-fdcd-4f4c-8c38-4a54c56d48db/download
id	ripl_abceb7fe507a67371986dcd2b375c4e4
identifier.url.fl_str_mv	http://hdl.handle.net/10400.21/4047
instacron_str	ipl
institution	Instituto Politécnico de Lisboa
instname_str	Instituto Politécnico de Lisboa
language	eng
network_acronym_str	ripl
network_name_str	Repositório Científico do Instituto Politécnico de Lisboa
oai_identifier_str	oai:repositorio.ipl.pt:10400.21/4047
organization_str_mv	urn:organizationAcronym:ipl
person_str_mv	Silvestre, Cláudia Silvestre, Cláudia https://www.ciencia-id.pt/DA12-EF3F-C7CD DA12-EF3F-C7CD http://orcid.org/0000-0002-8850-4304 0000-0002-8850-4304 Cardoso, Margarida Figueiredo, Mário
publishDate	2012
reponame_str	Repositório Científico do Instituto Politécnico de Lisboa
repository_id_str	urn:repositoryAcronym:ripl
service_str_mv	urn:repositoryAcronym:ripl
spelling	engenResearch on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.application/mswordporCategorical data clustering using a minimum message length criterionPersonalSilvestre, CláudiaDSpacehttp://dspace.org/items/08fbc1bf-3387-4137-8c03-c4664dd43375DSpacehttp://dspace.org/items/08fbc1bf-3387-4137-8c03-c4664dd43375SilvestreCláudiaCiência IDhttps://www.ciencia-id.ptDA12-EF3F-C7CDORCIDhttp://orcid.org0000-0002-8850-4304Cardoso, MargaridaFigueiredo, MárioHostingInstitutionOrganizationalRCIPLe-mailmailto:rcaap@sp.ipl.ptrcaap@sp.ipl.pt2014-12-12T12:22:05Z2012-102012-10-01T00:00:00ZHandlehttp://hdl.handle.net/10400.21/4047http://purl.org/coar/access_right/c_abf2open accessCluster analysisCategorical dataExpectation-maximization algorithmMML - Minimum Message Lenght - criterion29696 bytesother research producthttp://purl.org/coar/resource_type/c_c94fconference objecthttp://purl.org/coar/access_right/c_abf2application/mswordfulltexthttps://repositorio.ipl.pt/bitstreams/e8ca65a9-fdcd-4f4c-8c38-4a54c56d48db/downloadThe Eleventh International Symposium on Intelligent Data Analysis (IDA 2012)Helsinki (Finland)
spellingShingle	Categorical data clustering using a minimum message length criterion Silvestre, Cláudia Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion
status	SINGLETON
subject.fl_str_mv	Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion
title	Categorical data clustering using a minimum message length criterion
title_full	Categorical data clustering using a minimum message length criterion
title_fullStr	Categorical data clustering using a minimum message length criterion
title_full_unstemmed	Categorical data clustering using a minimum message length criterion
title_short	Categorical data clustering using a minimum message length criterion
title_sort	Categorical data clustering using a minimum message length criterion
topic	Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion
topic_facet	Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion
url	http://hdl.handle.net/10400.21/4047
visible	1

Publicação

Categorical data clustering using a minimum message length criterion

Registos relacionados