Publicação
Distributed AI training platform
| Resumo: | Training large-scale artificial intelligence models has become a critical challenge in modern research, requiring distributed infrastructures capable of efficiently coordinating multiple devices. This dissertation presents a comparative analysis of three distributed deep learning training platforms: PyTorch Distributed Data Parallel (DDP), Apache Spark, and Determined AI, evaluating their performance, resource management capabilities, and usability in organizational environments. The methodology involved implementing and testing each framework on a three-node cluster equipped with NVIDIA GPUs, using the BERT-tiny model for sentiment classification on the IMDB dataset. Quantitative metrics of training time, model accuracy, and scaling efficiency were collected, complemented by qualitative evaluation of configuration complexity, orchestration features, and developer experience. Results demonstrate that PyTorch DDP offers the best absolute performance, completing 20 epochs of training in 499 seconds with 2 GPUs, while Determinedm AI introduces a 21% overhead but provides superior cluster management capabilities, including automatic scheduling, experiment tracking, and fault tolerance. Apache Spark presents significant overhead (187%) but integrates naturally into existing data processing pipelines. Framework selection depends on context: DDP is ideal for individual researchers prioritizing speed, Determined AI suits shared environments requiring reproducibility and centralized management, and Spark serves scenarios where training is integrated into broader big data workflows. |
|---|---|
| Autores principais: | Cerqueiro, Tiago Andrés |
| Assunto: | Distributed training Deep learning Machine learning Parallel computing |
| Ano: | 2025 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Instituto Politécnico de Bragança |
| Idioma: | inglês |
| Origem: | Biblioteca Digital do IPB |
Registos relacionados
article Predicting model training time to optimize distributed machine learning applications
por: Guimarães, Miguel
Publicado em: (2023)
por: Guimarães, Miguel
Publicado em: (2023)
school Distributed Learning of Convolutional Neural Networks on Heterogeneous Processing Units
por: Marques, José Fernando Duarte
Publicado em: (2016)
por: Marques, José Fernando Duarte
Publicado em: (2016)
article The impact of data selection strategies on distributed model performance
por: Guimarães, Miguel
Publicado em: (2023)
por: Guimarães, Miguel
Publicado em: (2023)
article HaaS - a platform for password cracking in distributed heterogeneous systems
por: Lima, Carlos
Publicado em: (2025)
por: Lima, Carlos
Publicado em: (2025)
article Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
por: Oliveira, Filipe
Publicado em: (2024)
por: Oliveira, Filipe
Publicado em: (2024)
school Machine learning agents for computer games
por: Araújo, Miguel Diogo Ferraz
Publicado em: (2021)
por: Araújo, Miguel Diogo Ferraz
Publicado em: (2021)
article Development of deep learning approaches to predict relationships between chemical structures and sweetness
por: Capela, João
Publicado em: (2022)
por: Capela, João
Publicado em: (2022)
school A Model for Scientific Workflows with Parallel and Distributed Computing
por: Assunção, Luís Manuel da Costa
Publicado em: (2016)
por: Assunção, Luís Manuel da Costa
Publicado em: (2016)
article An automated and distributed machine learning framework for telecommunications risk management
por: Ferreira, Luís
Publicado em: (2020)
por: Ferreira, Luís
Publicado em: (2020)
article MUMPS based approach to parallelize the block cimmino algorithm
por: Balsa, Carlos
Publicado em: (2008)
por: Balsa, Carlos
Publicado em: (2008)
school Neural Networks, DeepFloat & TensorFlow Lite; Post-Training Quantization Case Study
por: Dias, Simão Pedro das Neves Gonçalves
Publicado em: (2020)
por: Dias, Simão Pedro das Neves Gonçalves
Publicado em: (2020)
school Dynamic Management of Distributed Machine Learning Problems
por: Oliveira, Filipe Vamonde
Publicado em: (2023)
por: Oliveira, Filipe Vamonde
Publicado em: (2023)
school Explorations of the semantic learning machine neuroevolution algorithm: dynamic training data use and ensemble construction methods
por: Seca, Marta Sofia Lopes
Publicado em: (2020)
por: Seca, Marta Sofia Lopes
Publicado em: (2020)
school Web machine learning services
por: Ferreira, João Paulo Ramos Carrasco
Publicado em: (2021)
por: Ferreira, João Paulo Ramos Carrasco
Publicado em: (2021)
school Embedded real-time vision-based control and inspection of an industrial process
por: Marques, Miguel Filipe Gaspar
Publicado em: (2023)
por: Marques, Miguel Filipe Gaspar
Publicado em: (2023)
school Conservation of Marine Life with the help of AI and big data
por: Amorim, João Miguel Cunha
Publicado em: (2019)
por: Amorim, João Miguel Cunha
Publicado em: (2019)
school Conillon: Distributed computing platform for desktop grids
por: Silva, Hélio Alexandre Dias da
Publicado em: (2011)
por: Silva, Hélio Alexandre Dias da
Publicado em: (2011)
article An hybrid approach for the parallelization of a block iterative algorithm
por: Balsa, Carlos
Publicado em: (2010)
por: Balsa, Carlos
Publicado em: (2010)
school Distributed deep learning for sleep apnea detection on ECG signals
por: Machado, Ana Margarida da Silva
Publicado em: (2020)
por: Machado, Ana Margarida da Silva
Publicado em: (2020)
article Wind turbines drive train fault detection: random forests vs CNNs
por: Daniel, Helder
Publicado em: (2025)
por: Daniel, Helder
Publicado em: (2025)
article Data-intensive task scheduling in geo-distributed cloud computing
por: Liu, Zhaoze
Publicado em: (2026)
por: Liu, Zhaoze
Publicado em: (2026)
school Development and Training of an Object Detection System
por: Pereira, João Margato Borlido
Publicado em: (2025)
por: Pereira, João Margato Borlido
Publicado em: (2025)
groups Emotion classification based on single electrode brain data: applications for assistive technology
por: Rodrigues, Duarte
Publicado em: (2023)
por: Rodrigues, Duarte
Publicado em: (2023)
article Detecting and monitoring the development stages of wild flowers and plants using computer vision: approaches, challenges and opportunities
por: Videira, João
Publicado em: (2023)
por: Videira, João
Publicado em: (2023)
article Computer vision in augmented, virtual, mixed and extended reality environments—a bibliometric review
por: Lopes, Júlio Castro
Publicado em: (2024)
por: Lopes, Júlio Castro
Publicado em: (2024)
article Detecting and monitoring the development stages of wild flowers and plants using computer vision: Approaches, challenges and opportunities
por: Videira, João
Publicado em: (2023)
por: Videira, João
Publicado em: (2023)
school Ultrasound versus elastography in the study of hepatic steatosis
por: Marques, Rodrigo Ramos
Publicado em: (2024)
por: Marques, Rodrigo Ramos
Publicado em: (2024)
school Utilização de ferramentas de machine learning no diagnóstico de patologias da laringe
por: Teixeira, Felipe
Publicado em: (2019)
por: Teixeira, Felipe
Publicado em: (2019)
school A ecografia versus elastografia no estudo da sarcopenia
por: Lopes, Luís André Mendes
Publicado em: (2023)
por: Lopes, Luís André Mendes
Publicado em: (2023)
school Automated Trading ASystem With Reinforcement Learning
por: Neves, José Luís Simões
Publicado em: (2023)
por: Neves, José Luís Simões
Publicado em: (2023)
school Detection of vehicles and buildings in drone aerial images
por: Amante, Rita Filipa dos Santos
Publicado em: (2022)
por: Amante, Rita Filipa dos Santos
Publicado em: (2022)
article A Parallel and Distributed Framework for Constraint Solving
por: Pedro, Vasco
Publicado em: (2012)
por: Pedro, Vasco
Publicado em: (2012)
school Inteligência Artificial em imagem médica: amiga ou adversária?
por: Gomes, Carlos Alberto Marques
Publicado em: (2020)
por: Gomes, Carlos Alberto Marques
Publicado em: (2020)
school Distributed Mail Transfer Agent
por: Santos, João Pedro de Sá Cardoso dos
Publicado em: (2020)
por: Santos, João Pedro de Sá Cardoso dos
Publicado em: (2020)
school The Transformative Potential of Machine Learning in the Energy Distribution Sector
por: Santos, Beatriz Silva
Publicado em: (2024)
por: Santos, Beatriz Silva
Publicado em: (2024)
school Deep reinforcement learning applied to an unrelated parallel machine scheduling problem: Deep Q-Network applied to a multi-objective unrelated parallel machine scheduling problem with sequence-dependent setup times, machine eligibility restrictions and a single common server
por: Neto, Celso Christiano Endres
Publicado em: (2024)
por: Neto, Celso Christiano Endres
Publicado em: (2024)
article Integration of AI Use Cases in Training to Support Industry 4.0
por: Nazarenko, Artem A.
Publicado em: (2024)
por: Nazarenko, Artem A.
Publicado em: (2024)
article An Image-Based Framework for Measuring the Prestress Level in CFRP Laminates: Experimental Validation
por: Valença, Jónatas
Publicado em: (2023)
por: Valença, Jónatas
Publicado em: (2023)
school Automated wound characterisation from images and multi-modal data
por: Curado, Tiago Moital
Publicado em: (2024)
por: Curado, Tiago Moital
Publicado em: (2024)
school Bicycles Mobility Prediction
por: Sousa, Tiago Nuno Barros
Publicado em: (2022)
por: Sousa, Tiago Nuno Barros
Publicado em: (2022)
Registos relacionados
-
article Predicting model training time to optimize distributed machine learning applications
por: Guimarães, Miguel
Publicado em: (2023) -
school Distributed Learning of Convolutional Neural Networks on Heterogeneous Processing Units
por: Marques, José Fernando Duarte
Publicado em: (2016) -
article The impact of data selection strategies on distributed model performance
por: Guimarães, Miguel
Publicado em: (2023) -
article HaaS - a platform for password cracking in distributed heterogeneous systems
por: Lima, Carlos
Publicado em: (2025) -
article Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
por: Oliveira, Filipe
Publicado em: (2024)