Publicação

Otimizações de armazenamento distribuído para aprendizagem profunda

Detalhes bibliográficos
Resumo:	In today’s world the utilization of Deep Learning (DL) is intrinsically integrated in the activity of several enterprises and industries. It allows us to extract knowledge from data, detect patterns and make pre dictions, increasing the competitivity and quality of the services provided. However, the DL frameworks (e.g., TensorFlow, PyTorch, Apache MxNet) require not only considerable of computational power, but also efficient data storage, since they need to deal with large amounts of data. In particular, in each iteration of the DL model train different batches of the training dataset are accessed to be processed and incorporated in the model. The retrieval of this data can be a bottleneck to the performance of the system, since the datasets are getting increasingly bigger, reaching sizes in the order of TBs. In the case of multi-node DL this becomes increasingly critical since there are many compute nodes training models, possibly with the same dataset, resulting in more requests directed to the shared file system competing with each other. If data could be stored nearer to the computational nodes and those nodes shared the data with one another, it would reduce the I/O pressure in the shared storage system and potentially reduce the time taken by these accesses and, consequently, the training time. This thesis presents DistMonarch, a DL framework agnostic system that takes advantage of the storage system hierarchy by copying data to levels closer to each compute node and allows the nodes to share data with each other, in a transparent manner. Results show that using this system reduces accesses to the shared file system by up to 90% and training time of some models and configurations by up to 48%.
Autores principais:	Moreira, Maria Beatriz Cardoso Gonçalves Barbosa e
Assunto:	I/O Optimization Multi-node deep learning Otimização de E/S Aprendizagem profunda multi-nodo
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade do Minho
Idioma:	português
Origem:	RepositóriUM - Universidade do Minho

Descrição
Resumo:	In today’s world the utilization of Deep Learning (DL) is intrinsically integrated in the activity of several enterprises and industries. It allows us to extract knowledge from data, detect patterns and make pre dictions, increasing the competitivity and quality of the services provided. However, the DL frameworks (e.g., TensorFlow, PyTorch, Apache MxNet) require not only considerable of computational power, but also efficient data storage, since they need to deal with large amounts of data. In particular, in each iteration of the DL model train different batches of the training dataset are accessed to be processed and incorporated in the model. The retrieval of this data can be a bottleneck to the performance of the system, since the datasets are getting increasingly bigger, reaching sizes in the order of TBs. In the case of multi-node DL this becomes increasingly critical since there are many compute nodes training models, possibly with the same dataset, resulting in more requests directed to the shared file system competing with each other. If data could be stored nearer to the computational nodes and those nodes shared the data with one another, it would reduce the I/O pressure in the shared storage system and potentially reduce the time taken by these accesses and, consequently, the training time. This thesis presents DistMonarch, a DL framework agnostic system that takes advantage of the storage system hierarchy by copying data to levels closer to each compute node and allows the nodes to share data with each other, in a transparent manner. Results show that using this system reduces accesses to the shared file system by up to 90% and training time of some models and configurations by up to 48%.