Publicação

Distributed AI training platform

Detalhes bibliográficos
Resumo:	Training large-scale artificial intelligence models has become a critical challenge in modern research, requiring distributed infrastructures capable of efficiently coordinating multiple devices. This dissertation presents a comparative analysis of three distributed deep learning training platforms: PyTorch Distributed Data Parallel (DDP), Apache Spark, and Determined AI, evaluating their performance, resource management capabilities, and usability in organizational environments. The methodology involved implementing and testing each framework on a three-node cluster equipped with NVIDIA GPUs, using the BERT-tiny model for sentiment classification on the IMDB dataset. Quantitative metrics of training time, model accuracy, and scaling efficiency were collected, complemented by qualitative evaluation of configuration complexity, orchestration features, and developer experience. Results demonstrate that PyTorch DDP offers the best absolute performance, completing 20 epochs of training in 499 seconds with 2 GPUs, while Determinedm AI introduces a 21% overhead but provides superior cluster management capabilities, including automatic scheduling, experiment tracking, and fault tolerance. Apache Spark presents significant overhead (187%) but integrates naturally into existing data processing pipelines. Framework selection depends on context: DDP is ideal for individual researchers prioritizing speed, Determined AI suits shared environments requiring reproducibility and centralized management, and Spark serves scenarios where training is integrated into broader big data workflows.
Autores principais:	Cerqueiro, Tiago Andrés
Assunto:	Distributed training Deep learning Machine learning Parallel computing
Ano:	2025
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Instituto Politécnico de Bragança
Idioma:	inglês
Origem:	Biblioteca Digital do IPB

_version_	1867172978205130752
author	Cerqueiro, Tiago Andrés
author_facet	Cerqueiro, Tiago Andrés
author_role	author
contributor_name_str_mv	Lopes, Rui Pedro Rufino, José Biblioteca Digital do IPB
country_str	PT
creators_json_txt	[{\"Person.name\":\"Cerqueiro, Tiago Andrés\"}]
datacite.contributors.contributor.contributorName.fl_str_mv	Lopes, Rui Pedro Rufino, José Biblioteca Digital do IPB
datacite.creators.creator.creatorName.fl_str_mv	Cerqueiro, Tiago Andrés
datacite.date.Accepted.fl_str_mv	2025-01-01T00:00:00Z
datacite.date.available.fl_str_mv	2026-01-27T10:46:21Z
datacite.date.embargoed.fl_str_mv	2026-01-27T10:46:21Z
datacite.rights.fl_str_mv	http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv	Distributed training Deep learning Machine learning Parallel computing
datacite.titles.title.fl_str_mv	Distributed AI training platform
dc.contributor.none.fl_str_mv	Lopes, Rui Pedro Rufino, José Biblioteca Digital do IPB
dc.creator.none.fl_str_mv	Cerqueiro, Tiago Andrés
dc.date.Accepted.fl_str_mv	2025-01-01T00:00:00Z
dc.date.available.fl_str_mv	2026-01-27T10:46:21Z
dc.date.embargoed.fl_str_mv	2026-01-27T10:46:21Z
dc.format.none.fl_str_mv	application/pdf
dc.identifier.none.fl_str_mv	http://hdl.handle.net/10198/35626
dc.language.none.fl_str_mv	eng
dc.rights.cclincense.fl_str_mv	http://creativecommons.org/licenses/by/4.0/
dc.rights.none.fl_str_mv	http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv	Distributed training Deep learning Machine learning Parallel computing
dc.title.fl_str_mv	Distributed AI training platform
dc.type.none.fl_str_mv	http://purl.org/coar/resource_type/c_bdcc
description	Training large-scale artificial intelligence models has become a critical challenge in modern research, requiring distributed infrastructures capable of efficiently coordinating multiple devices. This dissertation presents a comparative analysis of three distributed deep learning training platforms: PyTorch Distributed Data Parallel (DDP), Apache Spark, and Determined AI, evaluating their performance, resource management capabilities, and usability in organizational environments. The methodology involved implementing and testing each framework on a three-node cluster equipped with NVIDIA GPUs, using the BERT-tiny model for sentiment classification on the IMDB dataset. Quantitative metrics of training time, model accuracy, and scaling efficiency were collected, complemented by qualitative evaluation of configuration complexity, orchestration features, and developer experience. Results demonstrate that PyTorch DDP offers the best absolute performance, completing 20 epochs of training in 499 seconds with 2 GPUs, while Determinedm AI introduces a 21% overhead but provides superior cluster management capabilities, including automatic scheduling, experiment tracking, and fault tolerance. Apache Spark presents significant overhead (187%) but integrates naturally into existing data processing pipelines. Framework selection depends on context: DDP is ideal for individual researchers prioritizing speed, Determined AI suits shared environments requiring reproducibility and centralized management, and Spark serves scenarios where training is integrated into broader big data workflows.
dirty	0
eu_rights_str_mv	openAccess
format	masterThesis
fulltext.url.fl_str_mv	https://bibliotecadigital.ipb.pt/bitstreams/6eb12b14-4a01-4bac-8ec5-46097203ab7f/download
id	ipb_d2a7aa12c8ad05295421aeeceffbdacf
identifier.url.fl_str_mv	http://hdl.handle.net/10198/35626
instacron_str	ipb
institution	Instituto Politécnico de Bragança
instname_str	Instituto Politécnico de Bragança
language	eng
network_acronym_str	ipb
network_name_str	Biblioteca Digital do IPB
oai_identifier_str	oai:bibliotecadigital.ipb.pt:10198/35626
organization_str_mv	urn:organizationAcronym:ipb
person_str_mv	Cerqueiro, Tiago Andrés
publishDate	2025
reponame_str	Biblioteca Digital do IPB
repository_id_str	urn:repositoryAcronym:ipb
service_str_mv	urn:repositoryAcronym:ipb
spelling	engporTraining large-scale artificial intelligence models has become a critical challenge in modern research, requiring distributed infrastructures capable of efficiently coordinating multiple devices. This dissertation presents a comparative analysis of three distributed deep learning training platforms: PyTorch Distributed Data Parallel (DDP), Apache Spark, and Determined AI, evaluating their performance, resource management capabilities, and usability in organizational environments. The methodology involved implementing and testing each framework on a three-node cluster equipped with NVIDIA GPUs, using the BERT-tiny model for sentiment classification on the IMDB dataset. Quantitative metrics of training time, model accuracy, and scaling efficiency were collected, complemented by qualitative evaluation of configuration complexity, orchestration features, and developer experience. Results demonstrate that PyTorch DDP offers the best absolute performance, completing 20 epochs of training in 499 seconds with 2 GPUs, while Determinedm AI introduces a 21% overhead but provides superior cluster management capabilities, including automatic scheduling, experiment tracking, and fault tolerance. Apache Spark presents significant overhead (187%) but integrates naturally into existing data processing pipelines. Framework selection depends on context: DDP is ideal for individual researchers prioritizing speed, Determined AI suits shared environments requiring reproducibility and centralized management, and Spark serves scenarios where training is integrated into broader big data workflows.application/pdfDistributed AI training platformCerqueiro, Tiago AndrésLopes, Rui PedroRufino, JoséHostingInstitutionOrganizationalBiblioteca Digital do IPBe-mailmailto:dspace@ipb.ptdspace@ipb.ptURNurn:tid:2041626532026-01-27T10:46:21Z202520252025-01-01T00:00:00ZHandlehttp://hdl.handle.net/10198/35626http://purl.org/coar/access_right/c_abf2open accessDistributed trainingDeep learningMachine learningParallel computing2568984 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesis2025http://creativecommons.org/licenses/by/4.0/http://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://bibliotecadigital.ipb.pt/bitstreams/6eb12b14-4a01-4bac-8ec5-46097203ab7f/download
spellingShingle	Distributed AI training platform Cerqueiro, Tiago Andrés Distributed training Deep learning Machine learning Parallel computing
status	SINGLETON
subject.fl_str_mv	Distributed training Deep learning Machine learning Parallel computing
title	Distributed AI training platform
title_full	Distributed AI training platform
title_fullStr	Distributed AI training platform
title_full_unstemmed	Distributed AI training platform
title_short	Distributed AI training platform
title_sort	Distributed AI training platform
topic	Distributed training Deep learning Machine learning Parallel computing
topic_facet	Distributed training Deep learning Machine learning Parallel computing
url	http://hdl.handle.net/10198/35626
visible	1

Publicação

Distributed AI training platform

Registos relacionados