Publicação

Clustering of ordinal data

Detalhes bibliográficos
Resumo:	Several clustering methods have been developed for continuous multivariate data. However, these methods are often applied to analyse and cluster ordinal data as if they possess metric properties, thereby overlooking their ordinal nature. This study focuses on model-based clustering methods based on parameterized finite Gaussian mixture modeling, which utilize Expectation-Maximization (EM) algorithms to identify clusters. Using several selected datasets, the recovery ability of these types of clustering methods is investigated when the ratio scale of the data, if not originally ordinal, is changed to an ordinal scale. The EM algorithm clustering method for ordinal data considered was the pairwise likelihood approach developed by Ranalli and Rocci (2016). This EM algorithm approach was used to estimate the parameters of the mixture model. For a standard URV (Underlying Response Variable) approach, it is assumed that the ordinal variables are generated by a discretization of underlying multivariate normal variables. In this work, an extension of the URV approach was applied by taking a mixture of multivariate normal distributions. The pairwise EM algorithm was compared with a model-based clustering (EM algorithm) method, the Mclust function, developed in R. When the data is originally ordinal, the probit ordinal model was used in generating the underlying continuous data, which is assumed to be normally distributed, before applying the URV method. Three datasets were considered: The first dataset is the Iris dataset with 150 observations and 4 numerical variables. The second dataset is a public dataset on maternal health with 1014 observations which includes 7 measurements, 6 continuous and 1 ordinal. The third dataset is on the risk of dementia among patients with HIV/AIDS. Consists of 255 observations with 4 ordinal variables. The continuous variables in the Iris and the maternal health risk data were discretized to analyse them as ordinal data. Since the underlying latent mixture is known, the thresholds used for discretization of the latent variables, in the Pairwise EM algorithm, were those that maximized the adjusted rand index (ARI). The URV extension method recovered the cluster structure of the Iris data with an ARI of 0.922 and that of the maternal data with an ARI of 0.3. Applying Mclust on the datasets ARI of 0.90 and 0.15 were obtained for the Iris and maternal health datasets respectively. For the dementia data, a 4-component model was selected as the best model. The URV extension method can recover the cluster structure of the data even when it is applied to incomplete data and also to data that do not completely represent the original data (due to discretization). The three applications show that changing measurement scales from continuous (original data) to ordinal (via discretization) using the URV method, can enhance the recovery ability of model-based clustering (EM algorithm) methods.
Autores principais:	Ikechukwu, Blessing Ukamaka
Assunto:	Clustering EM algorithm Ordinal data URV approach
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade de Aveiro
Idioma:	inglês
Origem:	RIA - Repositório Institucional da Universidade de Aveiro

Descrição
Resumo:	Several clustering methods have been developed for continuous multivariate data. However, these methods are often applied to analyse and cluster ordinal data as if they possess metric properties, thereby overlooking their ordinal nature. This study focuses on model-based clustering methods based on parameterized finite Gaussian mixture modeling, which utilize Expectation-Maximization (EM) algorithms to identify clusters. Using several selected datasets, the recovery ability of these types of clustering methods is investigated when the ratio scale of the data, if not originally ordinal, is changed to an ordinal scale. The EM algorithm clustering method for ordinal data considered was the pairwise likelihood approach developed by Ranalli and Rocci (2016). This EM algorithm approach was used to estimate the parameters of the mixture model. For a standard URV (Underlying Response Variable) approach, it is assumed that the ordinal variables are generated by a discretization of underlying multivariate normal variables. In this work, an extension of the URV approach was applied by taking a mixture of multivariate normal distributions. The pairwise EM algorithm was compared with a model-based clustering (EM algorithm) method, the Mclust function, developed in R. When the data is originally ordinal, the probit ordinal model was used in generating the underlying continuous data, which is assumed to be normally distributed, before applying the URV method. Three datasets were considered: The first dataset is the Iris dataset with 150 observations and 4 numerical variables. The second dataset is a public dataset on maternal health with 1014 observations which includes 7 measurements, 6 continuous and 1 ordinal. The third dataset is on the risk of dementia among patients with HIV/AIDS. Consists of 255 observations with 4 ordinal variables. The continuous variables in the Iris and the maternal health risk data were discretized to analyse them as ordinal data. Since the underlying latent mixture is known, the thresholds used for discretization of the latent variables, in the Pairwise EM algorithm, were those that maximized the adjusted rand index (ARI). The URV extension method recovered the cluster structure of the Iris data with an ARI of 0.922 and that of the maternal data with an ARI of 0.3. Applying Mclust on the datasets ARI of 0.90 and 0.15 were obtained for the Iris and maternal health datasets respectively. For the dementia data, a 4-component model was selected as the best model. The URV extension method can recover the cluster structure of the data even when it is applied to incomplete data and also to data that do not completely represent the original data (due to discretization). The three applications show that changing measurement scales from continuous (original data) to ordinal (via discretization) using the URV method, can enhance the recovery ability of model-based clustering (EM algorithm) methods.