Publicação

Distributed AI training platform

Ver documento

Detalhes bibliográficos
Resumo:Training large-scale artificial intelligence models has become a critical challenge in modern research, requiring distributed infrastructures capable of efficiently coordinating multiple devices. This dissertation presents a comparative analysis of three distributed deep learning training platforms: PyTorch Distributed Data Parallel (DDP), Apache Spark, and Determined AI, evaluating their performance, resource management capabilities, and usability in organizational environments. The methodology involved implementing and testing each framework on a three-node cluster equipped with NVIDIA GPUs, using the BERT-tiny model for sentiment classification on the IMDB dataset. Quantitative metrics of training time, model accuracy, and scaling efficiency were collected, complemented by qualitative evaluation of configuration complexity, orchestration features, and developer experience. Results demonstrate that PyTorch DDP offers the best absolute performance, completing 20 epochs of training in 499 seconds with 2 GPUs, while Determinedm AI introduces a 21% overhead but provides superior cluster management capabilities, including automatic scheduling, experiment tracking, and fault tolerance. Apache Spark presents significant overhead (187%) but integrates naturally into existing data processing pipelines. Framework selection depends on context: DDP is ideal for individual researchers prioritizing speed, Determined AI suits shared environments requiring reproducibility and centralized management, and Spark serves scenarios where training is integrated into broader big data workflows.
Autores principais:Cerqueiro, Tiago Andrés
Assunto:Distributed training Deep learning Machine learning Parallel computing
Ano:2025
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Instituto Politécnico de Bragança
Idioma:inglês
Origem:Biblioteca Digital do IPB
_version_ 1867172978205130752
author Cerqueiro, Tiago Andrés
author_facet Cerqueiro, Tiago Andrés
author_role author
contributor_name_str_mv Lopes, Rui Pedro
Rufino, José
Biblioteca Digital do IPB
country_str PT
creators_json_txt [{\"Person.name\":\"Cerqueiro, Tiago Andrés\"}]
datacite.contributors.contributor.contributorName.fl_str_mv Lopes, Rui Pedro
Rufino, José
Biblioteca Digital do IPB
datacite.creators.creator.creatorName.fl_str_mv Cerqueiro, Tiago Andrés
datacite.date.Accepted.fl_str_mv 2025-01-01T00:00:00Z
datacite.date.available.fl_str_mv 2026-01-27T10:46:21Z
datacite.date.embargoed.fl_str_mv 2026-01-27T10:46:21Z
datacite.rights.fl_str_mv http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv Distributed training
Deep learning
Machine learning
Parallel computing
datacite.titles.title.fl_str_mv Distributed AI training platform
dc.contributor.none.fl_str_mv Lopes, Rui Pedro
Rufino, José
Biblioteca Digital do IPB
dc.creator.none.fl_str_mv Cerqueiro, Tiago Andrés
dc.date.Accepted.fl_str_mv 2025-01-01T00:00:00Z
dc.date.available.fl_str_mv 2026-01-27T10:46:21Z
dc.date.embargoed.fl_str_mv 2026-01-27T10:46:21Z
dc.format.none.fl_str_mv application/pdf
dc.identifier.none.fl_str_mv http://hdl.handle.net/10198/35626
dc.language.none.fl_str_mv eng
dc.rights.cclincense.fl_str_mv http://creativecommons.org/licenses/by/4.0/
dc.rights.none.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv Distributed training
Deep learning
Machine learning
Parallel computing
dc.title.fl_str_mv Distributed AI training platform
dc.type.none.fl_str_mv http://purl.org/coar/resource_type/c_bdcc
description Training large-scale artificial intelligence models has become a critical challenge in modern research, requiring distributed infrastructures capable of efficiently coordinating multiple devices. This dissertation presents a comparative analysis of three distributed deep learning training platforms: PyTorch Distributed Data Parallel (DDP), Apache Spark, and Determined AI, evaluating their performance, resource management capabilities, and usability in organizational environments. The methodology involved implementing and testing each framework on a three-node cluster equipped with NVIDIA GPUs, using the BERT-tiny model for sentiment classification on the IMDB dataset. Quantitative metrics of training time, model accuracy, and scaling efficiency were collected, complemented by qualitative evaluation of configuration complexity, orchestration features, and developer experience. Results demonstrate that PyTorch DDP offers the best absolute performance, completing 20 epochs of training in 499 seconds with 2 GPUs, while Determinedm AI introduces a 21% overhead but provides superior cluster management capabilities, including automatic scheduling, experiment tracking, and fault tolerance. Apache Spark presents significant overhead (187%) but integrates naturally into existing data processing pipelines. Framework selection depends on context: DDP is ideal for individual researchers prioritizing speed, Determined AI suits shared environments requiring reproducibility and centralized management, and Spark serves scenarios where training is integrated into broader big data workflows.
dirty 0
eu_rights_str_mv openAccess
format masterThesis
fulltext.url.fl_str_mv https://bibliotecadigital.ipb.pt/bitstreams/6eb12b14-4a01-4bac-8ec5-46097203ab7f/download
id ipb_d2a7aa12c8ad05295421aeeceffbdacf
identifier.url.fl_str_mv http://hdl.handle.net/10198/35626
instacron_str ipb
institution Instituto Politécnico de Bragança
instname_str Instituto Politécnico de Bragança
language eng
network_acronym_str ipb
network_name_str Biblioteca Digital do IPB
oai_identifier_str oai:bibliotecadigital.ipb.pt:10198/35626
organization_str_mv urn:organizationAcronym:ipb
person_str_mv Cerqueiro, Tiago Andrés
publishDate 2025
reponame_str Biblioteca Digital do IPB
repository_id_str urn:repositoryAcronym:ipb
service_str_mv urn:repositoryAcronym:ipb
spelling engporTraining large-scale artificial intelligence models has become a critical challenge in modern research, requiring distributed infrastructures capable of efficiently coordinating multiple devices. This dissertation presents a comparative analysis of three distributed deep learning training platforms: PyTorch Distributed Data Parallel (DDP), Apache Spark, and Determined AI, evaluating their performance, resource management capabilities, and usability in organizational environments. The methodology involved implementing and testing each framework on a three-node cluster equipped with NVIDIA GPUs, using the BERT-tiny model for sentiment classification on the IMDB dataset. Quantitative metrics of training time, model accuracy, and scaling efficiency were collected, complemented by qualitative evaluation of configuration complexity, orchestration features, and developer experience. Results demonstrate that PyTorch DDP offers the best absolute performance, completing 20 epochs of training in 499 seconds with 2 GPUs, while Determinedm AI introduces a 21% overhead but provides superior cluster management capabilities, including automatic scheduling, experiment tracking, and fault tolerance. Apache Spark presents significant overhead (187%) but integrates naturally into existing data processing pipelines. Framework selection depends on context: DDP is ideal for individual researchers prioritizing speed, Determined AI suits shared environments requiring reproducibility and centralized management, and Spark serves scenarios where training is integrated into broader big data workflows.application/pdfDistributed AI training platformCerqueiro, Tiago AndrésLopes, Rui PedroRufino, JoséHostingInstitutionOrganizationalBiblioteca Digital do IPBe-mailmailto:dspace@ipb.ptdspace@ipb.ptURNurn:tid:2041626532026-01-27T10:46:21Z202520252025-01-01T00:00:00ZHandlehttp://hdl.handle.net/10198/35626http://purl.org/coar/access_right/c_abf2open accessDistributed trainingDeep learningMachine learningParallel computing2568984 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesis2025http://creativecommons.org/licenses/by/4.0/http://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://bibliotecadigital.ipb.pt/bitstreams/6eb12b14-4a01-4bac-8ec5-46097203ab7f/download
spellingShingle Distributed AI training platform
Cerqueiro, Tiago Andrés
Distributed training
Deep learning
Machine learning
Parallel computing
status SINGLETON
subject.fl_str_mv Distributed training
Deep learning
Machine learning
Parallel computing
title Distributed AI training platform
title_full Distributed AI training platform
title_fullStr Distributed AI training platform
title_full_unstemmed Distributed AI training platform
title_short Distributed AI training platform
title_sort Distributed AI training platform
topic Distributed training
Deep learning
Machine learning
Parallel computing
topic_facet Distributed training
Deep learning
Machine learning
Parallel computing
url http://hdl.handle.net/10198/35626
visible 1