Publicação

Distributed AI training platform

Detalhes bibliográficos
Resumo:	Training large-scale artificial intelligence models has become a critical challenge in modern research, requiring distributed infrastructures capable of efficiently coordinating multiple devices. This dissertation presents a comparative analysis of three distributed deep learning training platforms: PyTorch Distributed Data Parallel (DDP), Apache Spark, and Determined AI, evaluating their performance, resource management capabilities, and usability in organizational environments. The methodology involved implementing and testing each framework on a three-node cluster equipped with NVIDIA GPUs, using the BERT-tiny model for sentiment classification on the IMDB dataset. Quantitative metrics of training time, model accuracy, and scaling efficiency were collected, complemented by qualitative evaluation of configuration complexity, orchestration features, and developer experience. Results demonstrate that PyTorch DDP offers the best absolute performance, completing 20 epochs of training in 499 seconds with 2 GPUs, while Determinedm AI introduces a 21% overhead but provides superior cluster management capabilities, including automatic scheduling, experiment tracking, and fault tolerance. Apache Spark presents significant overhead (187%) but integrates naturally into existing data processing pipelines. Framework selection depends on context: DDP is ideal for individual researchers prioritizing speed, Determined AI suits shared environments requiring reproducibility and centralized management, and Spark serves scenarios where training is integrated into broader big data workflows.
Autores principais:	Cerqueiro, Tiago Andrés
Assunto:	Distributed training Deep learning Machine learning Parallel computing
Ano:	2025
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Instituto Politécnico de Bragança
Idioma:	inglês
Origem:	Biblioteca Digital do IPB

Registos relacionados

Predicting model training time to optimize distributed machine learning applications
por: Guimarães, Miguel
Publicado em: (2023)

Distributed Learning of Convolutional Neural Networks on Heterogeneous Processing Units
por: Marques, José Fernando Duarte
Publicado em: (2016)

The impact of data selection strategies on distributed model performance
por: Guimarães, Miguel
Publicado em: (2023)

HaaS - a platform for password cracking in distributed heterogeneous systems
por: Lima, Carlos
Publicado em: (2025)

Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
por: Oliveira, Filipe
Publicado em: (2024)

Machine learning agents for computer games
por: Araújo, Miguel Diogo Ferraz
Publicado em: (2021)

Development of deep learning approaches to predict relationships between chemical structures and sweetness
por: Capela, João
Publicado em: (2022)

A Model for Scientific Workflows with Parallel and Distributed Computing
por: Assunção, Luís Manuel da Costa
Publicado em: (2016)

An automated and distributed machine learning framework for telecommunications risk management
por: Ferreira, Luís
Publicado em: (2020)

MUMPS based approach to parallelize the block cimmino algorithm
por: Balsa, Carlos
Publicado em: (2008)

Neural Networks, DeepFloat & TensorFlow Lite; Post-Training Quantization Case Study
por: Dias, Simão Pedro das Neves Gonçalves
Publicado em: (2020)

Dynamic Management of Distributed Machine Learning Problems
por: Oliveira, Filipe Vamonde
Publicado em: (2023)

Explorations of the semantic learning machine neuroevolution algorithm: dynamic training data use and ensemble construction methods
por: Seca, Marta Sofia Lopes
Publicado em: (2020)

Web machine learning services
por: Ferreira, João Paulo Ramos Carrasco
Publicado em: (2021)

Embedded real-time vision-based control and inspection of an industrial process
por: Marques, Miguel Filipe Gaspar
Publicado em: (2023)

Conservation of Marine Life with the help of AI and big data
por: Amorim, João Miguel Cunha
Publicado em: (2019)

Conillon: Distributed computing platform for desktop grids
por: Silva, Hélio Alexandre Dias da
Publicado em: (2011)

An hybrid approach for the parallelization of a block iterative algorithm
por: Balsa, Carlos
Publicado em: (2010)

Distributed deep learning for sleep apnea detection on ECG signals
por: Machado, Ana Margarida da Silva
Publicado em: (2020)

Wind turbines drive train fault detection: random forests vs CNNs
por: Daniel, Helder
Publicado em: (2025)

Data-intensive task scheduling in geo-distributed cloud computing
por: Liu, Zhaoze
Publicado em: (2026)

Development and Training of an Object Detection System
por: Pereira, João Margato Borlido
Publicado em: (2025)

Emotion classification based on single electrode brain data: applications for assistive technology
por: Rodrigues, Duarte
Publicado em: (2023)

Detecting and monitoring the development stages of wild flowers and plants using computer vision: approaches, challenges and opportunities
por: Videira, João
Publicado em: (2023)

Computer vision in augmented, virtual, mixed and extended reality environments—a bibliometric review
por: Lopes, Júlio Castro
Publicado em: (2024)

Detecting and monitoring the development stages of wild flowers and plants using computer vision: Approaches, challenges and opportunities
por: Videira, João
Publicado em: (2023)

Ultrasound versus elastography in the study of hepatic steatosis
por: Marques, Rodrigo Ramos
Publicado em: (2024)

Utilização de ferramentas de machine learning no diagnóstico de patologias da laringe
por: Teixeira, Felipe
Publicado em: (2019)

A ecografia versus elastografia no estudo da sarcopenia
por: Lopes, Luís André Mendes
Publicado em: (2023)

Automated Trading ASystem With Reinforcement Learning
por: Neves, José Luís Simões
Publicado em: (2023)

Detection of vehicles and buildings in drone aerial images
por: Amante, Rita Filipa dos Santos
Publicado em: (2022)

A Parallel and Distributed Framework for Constraint Solving
por: Pedro, Vasco
Publicado em: (2012)

Inteligência Artificial em imagem médica: amiga ou adversária?
por: Gomes, Carlos Alberto Marques
Publicado em: (2020)

Distributed Mail Transfer Agent
por: Santos, João Pedro de Sá Cardoso dos
Publicado em: (2020)

The Transformative Potential of Machine Learning in the Energy Distribution Sector
por: Santos, Beatriz Silva
Publicado em: (2024)

Deep reinforcement learning applied to an unrelated parallel machine scheduling problem: Deep Q-Network applied to a multi-objective unrelated parallel machine scheduling problem with sequence-dependent setup times, machine eligibility restrictions and a single common server
por: Neto, Celso Christiano Endres
Publicado em: (2024)

Integration of AI Use Cases in Training to Support Industry 4.0
por: Nazarenko, Artem A.
Publicado em: (2024)

An Image-Based Framework for Measuring the Prestress Level in CFRP Laminates: Experimental Validation
por: Valença, Jónatas
Publicado em: (2023)

Automated wound characterisation from images and multi-modal data
por: Curado, Tiago Moital
Publicado em: (2024)

Bicycles Mobility Prediction
por: Sousa, Tiago Nuno Barros
Publicado em: (2022)