Document details

Outlier detection for improved clustering : empirical research for unsupervised data mining

Author(s): Madsen, Jacob Hastrup

Date: 2018

Persistent ID: http://hdl.handle.net/10362/34464

Origin: Repositório Institucional da UNL

Subject(s): Outlier Detection; Unsupervised Learning; Clustering; Data Mining


Description

Many clustering algorithms are sensitive to noise disturbing the results when trying to identify and characterize clusters in data. Due to the multidimensional nature of clustering, the discipline of outlier detection is a complex task as statistical approaches are not adequate. In this research work, we contend that for clustering, outliers should be perceived as observations with deviating characteristics worsening the ratio of intra-cluster and inter-cluster distance. We present a research question that deals with improving clustering results specifically for the two clustering algorithms, k-means and hierarchical clustering, by the means of outlier detection. To improve clustering results, we identify and discuss the literature of outlier detection, and undertake on 11 algorithms and 2 statistical test to the process of treating data prior to clustering. To evaluate the results of applied clustering, six evaluation metrics are applied, of which one metric is introduced in this study. Using real world datasets, we demonstrate that outlier detection does improve clustering results with respect to clustering objectives, but only to an extent where data allows it. That is, if data contains ‘real’ clusters and actual outliers, proper use of outlier algorithms improves clustering significantly. Advantages and disadvantages for outlier algorithms, when dealing with different types of data, are discussed along with the different properties of evaluation metrics describing the fulfillment of clustering objectives. Finally, it is demonstrated that the main challenge of improving clustering results for users, with regards to outlier detection, is the lack of tools to understand data structures prior to clustering. Future research is emphasized for tools such as dimension reduction, to help users avoid applying every tool in the toolbox.

Document Type Master thesis
Language English
Advisor(s) Costa, Ana Cristina Marinho da
Contributor(s) Madsen, Jacob Hastrup
facebook logo  linkedin logo  twitter logo 
mendeley logo

Related documents

No related documents