Publicação
Otimizações de armazenamento distribuído para aprendizagem profunda
| Resumo: | In today’s world the utilization of Deep Learning (DL) is intrinsically integrated in the activity of several enterprises and industries. It allows us to extract knowledge from data, detect patterns and make pre dictions, increasing the competitivity and quality of the services provided. However, the DL frameworks (e.g., TensorFlow, PyTorch, Apache MxNet) require not only considerable of computational power, but also efficient data storage, since they need to deal with large amounts of data. In particular, in each iteration of the DL model train different batches of the training dataset are accessed to be processed and incorporated in the model. The retrieval of this data can be a bottleneck to the performance of the system, since the datasets are getting increasingly bigger, reaching sizes in the order of TBs. In the case of multi-node DL this becomes increasingly critical since there are many compute nodes training models, possibly with the same dataset, resulting in more requests directed to the shared file system competing with each other. If data could be stored nearer to the computational nodes and those nodes shared the data with one another, it would reduce the I/O pressure in the shared storage system and potentially reduce the time taken by these accesses and, consequently, the training time. This thesis presents DistMonarch, a DL framework agnostic system that takes advantage of the storage system hierarchy by copying data to levels closer to each compute node and allows the nodes to share data with each other, in a transparent manner. Results show that using this system reduces accesses to the shared file system by up to 90% and training time of some models and configurations by up to 48%. |
|---|---|
| Autores principais: | Moreira, Maria Beatriz Cardoso Gonçalves Barbosa e |
| Assunto: | I/O Optimization Multi-node deep learning Otimização de E/S Aprendizagem profunda multi-nodo |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade do Minho |
| Idioma: | português |
| Origem: | RepositóriUM - Universidade do Minho |
| Resumo: | In today’s world the utilization of Deep Learning (DL) is intrinsically integrated in the activity of several enterprises and industries. It allows us to extract knowledge from data, detect patterns and make pre dictions, increasing the competitivity and quality of the services provided. However, the DL frameworks (e.g., TensorFlow, PyTorch, Apache MxNet) require not only considerable of computational power, but also efficient data storage, since they need to deal with large amounts of data. In particular, in each iteration of the DL model train different batches of the training dataset are accessed to be processed and incorporated in the model. The retrieval of this data can be a bottleneck to the performance of the system, since the datasets are getting increasingly bigger, reaching sizes in the order of TBs. In the case of multi-node DL this becomes increasingly critical since there are many compute nodes training models, possibly with the same dataset, resulting in more requests directed to the shared file system competing with each other. If data could be stored nearer to the computational nodes and those nodes shared the data with one another, it would reduce the I/O pressure in the shared storage system and potentially reduce the time taken by these accesses and, consequently, the training time. This thesis presents DistMonarch, a DL framework agnostic system that takes advantage of the storage system hierarchy by copying data to levels closer to each compute node and allows the nodes to share data with each other, in a transparent manner. Results show that using this system reduces accesses to the shared file system by up to 90% and training time of some models and configurations by up to 48%. |
|---|