Publicação
Categorical data clustering using a minimum message length criterion
| Resumo: | Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets. |
|---|---|
| Autores principais: | Silvestre, Cláudia |
| Outros Autores: | Cardoso, Margarida; Figueiredo, Mário |
| Assunto: | Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion |
| Ano: | 2012 |
| País: | Portugal |
| Tipo de documento: | documento de conferência |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Instituto Politécnico de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Científico do Instituto Politécnico de Lisboa |
| _version_ | 1866887642246807552 |
|---|---|
| author | Silvestre, Cláudia |
| author2 | Cardoso, Margarida Figueiredo, Mário |
| author2_role | author author |
| author_facet | Silvestre, Cláudia Cardoso, Margarida Figueiredo, Mário |
| author_role | author |
| contributor_name_str_mv | RCIPL |
| country_str | PT |
| creators_json_txt | [{\"Person.name\":\"Silvestre, Cláudia\",\"Person.identifier.orcid\":\"0000-0002-8850-4304\"},{\"Person.name\":\"Cardoso, Margarida\"},{\"Person.name\":\"Figueiredo, Mário\"}] |
| datacite.contributors.contributor.contributorName.fl_str_mv | RCIPL |
| datacite.creators.creator.creatorName.fl_str_mv | Silvestre, Cláudia Cardoso, Margarida Figueiredo, Mário |
| datacite.date.Accepted.fl_str_mv | 2012-10-01T00:00:00Z |
| datacite.date.available.fl_str_mv | 2014-12-12T12:22:05Z |
| datacite.date.embargoed.fl_str_mv | 2014-12-12T12:22:05Z |
| datacite.rights.fl_str_mv | http://purl.org/coar/access_right/c_abf2 |
| datacite.subjects.subject.fl_str_mv | Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion |
| datacite.titles.title.fl_str_mv | Categorical data clustering using a minimum message length criterion |
| dc.contributor.none.fl_str_mv | RCIPL |
| dc.creator.none.fl_str_mv | Silvestre, Cláudia Cardoso, Margarida Figueiredo, Mário |
| dc.date.Accepted.fl_str_mv | 2012-10-01T00:00:00Z |
| dc.date.available.fl_str_mv | 2014-12-12T12:22:05Z |
| dc.date.embargoed.fl_str_mv | 2014-12-12T12:22:05Z |
| dc.format.none.fl_str_mv | application/msword |
| dc.identifier.none.fl_str_mv | http://hdl.handle.net/10400.21/4047 |
| dc.language.none.fl_str_mv | eng |
| dc.rights.none.fl_str_mv | http://purl.org/coar/access_right/c_abf2 |
| dc.subject.none.fl_str_mv | Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion |
| dc.title.fl_str_mv | Categorical data clustering using a minimum message length criterion |
| dc.type.none.fl_str_mv | http://purl.org/coar/resource_type/c_c94f |
| description | Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets. |
| dirty | 0 |
| eu_rights_str_mv | openAccess |
| format | conferenceObject |
| fulltext.url.fl_str_mv | https://repositorio.ipl.pt/bitstreams/e8ca65a9-fdcd-4f4c-8c38-4a54c56d48db/download |
| id | ripl_abceb7fe507a67371986dcd2b375c4e4 |
| identifier.url.fl_str_mv | http://hdl.handle.net/10400.21/4047 |
| instacron_str | ipl |
| institution | Instituto Politécnico de Lisboa |
| instname_str | Instituto Politécnico de Lisboa |
| language | eng |
| network_acronym_str | ripl |
| network_name_str | Repositório Científico do Instituto Politécnico de Lisboa |
| oai_identifier_str | oai:repositorio.ipl.pt:10400.21/4047 |
| organization_str_mv | urn:organizationAcronym:ipl |
| person_str_mv | Silvestre, Cláudia Silvestre, Cláudia https://www.ciencia-id.pt/DA12-EF3F-C7CD DA12-EF3F-C7CD http://orcid.org/0000-0002-8850-4304 0000-0002-8850-4304 Cardoso, Margarida Figueiredo, Mário |
| publishDate | 2012 |
| reponame_str | Repositório Científico do Instituto Politécnico de Lisboa |
| repository_id_str | urn:repositoryAcronym:ripl |
| service_str_mv | urn:repositoryAcronym:ripl |
| spelling | engenResearch on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.application/mswordporCategorical data clustering using a minimum message length criterionPersonalSilvestre, CláudiaDSpacehttp://dspace.org/items/08fbc1bf-3387-4137-8c03-c4664dd43375DSpacehttp://dspace.org/items/08fbc1bf-3387-4137-8c03-c4664dd43375SilvestreCláudiaCiência IDhttps://www.ciencia-id.ptDA12-EF3F-C7CDORCIDhttp://orcid.org0000-0002-8850-4304Cardoso, MargaridaFigueiredo, MárioHostingInstitutionOrganizationalRCIPLe-mailmailto:rcaap@sp.ipl.ptrcaap@sp.ipl.pt2014-12-12T12:22:05Z2012-102012-10-01T00:00:00ZHandlehttp://hdl.handle.net/10400.21/4047http://purl.org/coar/access_right/c_abf2open accessCluster analysisCategorical dataExpectation-maximization algorithmMML - Minimum Message Lenght - criterion29696 bytesother research producthttp://purl.org/coar/resource_type/c_c94fconference objecthttp://purl.org/coar/access_right/c_abf2application/mswordfulltexthttps://repositorio.ipl.pt/bitstreams/e8ca65a9-fdcd-4f4c-8c38-4a54c56d48db/downloadThe Eleventh International Symposium on Intelligent Data Analysis (IDA 2012)Helsinki (Finland) |
| spellingShingle | Categorical data clustering using a minimum message length criterion Silvestre, Cláudia Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion |
| status | SINGLETON |
| subject.fl_str_mv | Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion |
| title | Categorical data clustering using a minimum message length criterion |
| title_full | Categorical data clustering using a minimum message length criterion |
| title_fullStr | Categorical data clustering using a minimum message length criterion |
| title_full_unstemmed | Categorical data clustering using a minimum message length criterion |
| title_short | Categorical data clustering using a minimum message length criterion |
| title_sort | Categorical data clustering using a minimum message length criterion |
| topic | Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion |
| topic_facet | Cluster analysis Categorical data Expectation-maximization algorithm MML - Minimum Message Lenght - criterion |
| url | http://hdl.handle.net/10400.21/4047 |
| visible | 1 |