Publicação

GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets

Detalhes bibliográficos
Resumo:	With the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.
Autores principais:	Lopes, Gonçalo Pedro Santos
Assunto:	KNN Big Datasets Data Streams LSH ANN GPU Memory Management
Ano:	2021
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

Descrição
Resumo:	With the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.