Publicação

Detection of outliers and outliers clustering on large datasets with distributed computing

Detalhes bibliográficos
Resumo:	Outlier detection is a data analysis related problem, of great importance in diverse science fields and with many applications. Without a definitive formal definition and holding several other designations – deviations, anomalies, exceptions, noise, atypical data, – outliers are, succinctly, the samples in a dataset that, for some reason, are different from the rest of the set. It can be of interest to either remove them, as a filtering process to smoothing data, or collect them as new dataset holding additional information potentially relevant. Its importance can be seen from the broad range of applications, like fraud or intrusion detection, specialized pattern recognition, data filtering, scientific data mining, medical diagnosis, etc. Although an old problem, with roots in Statistics, the outlier detection problem has become more pertinent then ever and yet further difficult to deal with. Better and more ubiquitous ways of data acquisition and storage capacities increasing constantly, made the size of datasets grow considerably in recent years, along with its number and its availability. Larger volumes of data becomes harder to explore and filter, while simultaneously data treatment and analysis emerges as more demanded and fundamental in today’s life. Distributed computing is a computer science paradigm to distribute hard, complex problems across several independent machines, connected on a network. A problem is break down in more simple sub-problems, that are solved simultaneous by the autonomous machines, and all resultant sub-solutions collected and put together into a final solution. Distributed computing provides a solution for the limitations in the hardware scaling, both economical and physical, by building up computational capacity, as needed, with the addition of new machines, not necessarily new or advanced models, but any commodity hardware. This work presents several distributed computing algorithms to outlier detection, starting from a distributed version of an existent algorithm, CURIO[9], and introducing a series of optimizations and variants that leads to a new method, Curio3XD, that allows to resolve both the common issues typical of this problem, the constraints imposed by the size and the dimensionality of the datasets. The final version, and its variant, is applicable for any volume of data, by scaling the hardware in the distributed computing, and to high dimensionality datasets, by moving the original exponential dependency on the dimension to a dependency, quadratic, on the local density of data, easily tunable with an algorithm parameter, the precision. Intermediate versions are presented for the sake of clarification of the process that took to the final method, and as an alternative approach, possibly useful with very sparse datasets. For a distributed computing environment with full support for the distributed system and the underlying hardware infrastructure, it was chosen Apache Hadoop[23] as a platform for developing, implementation and testing, due to its power and flexibility, and yet relatively easy usability. This constitutes an open-source solution, well studied and documented, employed by several major companies, with an excellent applicability to both clouds and local clusters. The different algorithms and variants were developed within the MapReduce programing model, and implemented in the Hadoop framework, which supports that model. MapReduce was conceived to permit the deployment of distributed computing applications in a simple, developer-oriented way, with main focus on the programmatic solutions of the problems, and leaving the underneath distributed network control and maintenance absolutely transparent. The developed implementations are included in appendix. Results of tests, with an adapted real world dataset, showed very good performances of the referred algorithms’ final versions, with excellent scalability on both size and dimensionality of data, as previewed theoretically. Performance tests with the precision parameter and comparative tests between all variants developed are also presented and discussed.
Autores principais:	Pais, Rui Manuel Aleixo
Assunto:	Outlier detection MapReduce Hadoop Distributed computing Teses de mestrado - 2012
Ano:	2012
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade de Lisboa
Idioma:	inglês
Origem:	Repositório da Universidade de Lisboa

Descrição
Resumo:	Outlier detection is a data analysis related problem, of great importance in diverse science fields and with many applications. Without a definitive formal definition and holding several other designations – deviations, anomalies, exceptions, noise, atypical data, – outliers are, succinctly, the samples in a dataset that, for some reason, are different from the rest of the set. It can be of interest to either remove them, as a filtering process to smoothing data, or collect them as new dataset holding additional information potentially relevant. Its importance can be seen from the broad range of applications, like fraud or intrusion detection, specialized pattern recognition, data filtering, scientific data mining, medical diagnosis, etc. Although an old problem, with roots in Statistics, the outlier detection problem has become more pertinent then ever and yet further difficult to deal with. Better and more ubiquitous ways of data acquisition and storage capacities increasing constantly, made the size of datasets grow considerably in recent years, along with its number and its availability. Larger volumes of data becomes harder to explore and filter, while simultaneously data treatment and analysis emerges as more demanded and fundamental in today’s life. Distributed computing is a computer science paradigm to distribute hard, complex problems across several independent machines, connected on a network. A problem is break down in more simple sub-problems, that are solved simultaneous by the autonomous machines, and all resultant sub-solutions collected and put together into a final solution. Distributed computing provides a solution for the limitations in the hardware scaling, both economical and physical, by building up computational capacity, as needed, with the addition of new machines, not necessarily new or advanced models, but any commodity hardware. This work presents several distributed computing algorithms to outlier detection, starting from a distributed version of an existent algorithm, CURIO[9], and introducing a series of optimizations and variants that leads to a new method, Curio3XD, that allows to resolve both the common issues typical of this problem, the constraints imposed by the size and the dimensionality of the datasets. The final version, and its variant, is applicable for any volume of data, by scaling the hardware in the distributed computing, and to high dimensionality datasets, by moving the original exponential dependency on the dimension to a dependency, quadratic, on the local density of data, easily tunable with an algorithm parameter, the precision. Intermediate versions are presented for the sake of clarification of the process that took to the final method, and as an alternative approach, possibly useful with very sparse datasets. For a distributed computing environment with full support for the distributed system and the underlying hardware infrastructure, it was chosen Apache Hadoop[23] as a platform for developing, implementation and testing, due to its power and flexibility, and yet relatively easy usability. This constitutes an open-source solution, well studied and documented, employed by several major companies, with an excellent applicability to both clouds and local clusters. The different algorithms and variants were developed within the MapReduce programing model, and implemented in the Hadoop framework, which supports that model. MapReduce was conceived to permit the deployment of distributed computing applications in a simple, developer-oriented way, with main focus on the programmatic solutions of the problems, and leaving the underneath distributed network control and maintenance absolutely transparent. The developed implementations are included in appendix. Results of tests, with an adapted real world dataset, showed very good performances of the referred algorithms’ final versions, with excellent scalability on both size and dimensionality of data, as previewed theoretically. Performance tests with the precision parameter and comparative tests between all variants developed are also presented and discussed.