Publicação

A generalized pipeline infraestructure for developing multiple ML algorithms

Detalhes bibliográficos
Resumo:	Machine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.
Autores principais:	Vieira, Catarina Pais
Assunto:	Machine Learning Software Architecture Data management Argo workflows
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade do Minho
Idioma:	inglês
Origem:	RepositóriUM - Universidade do Minho

_version_	1866876004546379776
author	Vieira, Catarina Pais
author_facet	Vieira, Catarina Pais
author_role	author
contributor_name_str_mv	Ferreira, André Leite Fernandes, João M. Universidade do Minho
country_str	PT
creators_json_txt	[{\"Person.name\":\"Vieira, Catarina Pais\"}]
datacite.contributors.contributor.contributorName.fl_str_mv	Ferreira, André Leite Fernandes, João M. Universidade do Minho
datacite.creators.creator.creatorName.fl_str_mv	Vieira, Catarina Pais
datacite.date.Accepted.fl_str_mv	2024-04-09T00:00:00Z
datacite.date.available.fl_str_mv	2024-10-05T15:57:06Z
datacite.date.embargoed.fl_str_mv	2024-10-05T15:57:06Z
datacite.rights.fl_str_mv	http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv	Machine Learning Software Architecture Data management Argo workflows
datacite.titles.title.fl_str_mv	A generalized pipeline infraestructure for developing multiple ML algorithms Uma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de ML
dc.contributor.none.fl_str_mv	Ferreira, André Leite Fernandes, João M. Universidade do Minho
dc.creator.none.fl_str_mv	Vieira, Catarina Pais
dc.date.Accepted.fl_str_mv	2024-04-09T00:00:00Z
dc.date.available.fl_str_mv	2024-10-05T15:57:06Z
dc.date.embargoed.fl_str_mv	2024-10-05T15:57:06Z
dc.format.none.fl_str_mv	application/pdf
dc.identifier.none.fl_str_mv	https://hdl.handle.net/1822/93200
dc.language.none.fl_str_mv	eng
dc.rights.cclincense.fl_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.none.fl_str_mv	http://purl.org/coar/access_right/c_abf2
dc.rights.rights.copyright.fl_str_mv	openAccess
dc.subject.none.fl_str_mv	Machine Learning Software Architecture Data management Argo workflows
dc.title.fl_str_mv	A generalized pipeline infraestructure for developing multiple ML algorithms Uma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de ML
dc.type.none.fl_str_mv	http://purl.org/coar/resource_type/c_bdcc
description	Machine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.
dirty	0
eu_rights_str_mv	openAccess
format	masterThesis
fulltext.url.fl_str_mv	https://prod-dspace.uminho.pt/bitstreams/7cf8ddda-e0fd-49cc-b6c9-3be09db97055/download
id	rum_717cc489dfa92faa4c8bf6904e3fcd9d
identifier.url.fl_str_mv	https://hdl.handle.net/1822/93200
instacron_str	repositorium
institution	Universidade do Minho
instname_str	Universidade do Minho
language	eng
network_acronym_str	rum
network_name_str	RepositóriUM - Universidade do Minho
oai_identifier_str	oai:repositorium.uminho.pt:1822/93200
organization_str_mv	urn:organizationAcronym:repositorium
person_str_mv	Vieira, Catarina Pais
publishDate	2024
reponame_str	RepositóriUM - Universidade do Minho
repository_id_str	urn:repositoryAcronym:rum
service_str_mv	urn:repositoryAcronym:rum
spelling	engporMachine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.application/pdfporA generalized pipeline infraestructure for developing multiple ML algorithmsAlternativeTitleporUma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de MLVieira, Catarina PaisFerreira, André LeiteFernandes, João M.HostingInstitutionOrganizationalUniversidade do Minhoe-mailmailto:repositorium@usdb.uminho.ptrepositorium@usdb.uminho.ptURNurn:tid:2036687662024-10-05T15:57:06Z2024-04-092024-012024-04-09T00:00:00ZHandlehttps://hdl.handle.net/1822/93200http://purl.org/coar/access_right/c_abf2open accessMachine LearningSoftware ArchitectureData managementArgo workflows15081531 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesis2024-04-09http://creativecommons.org/licenses/by-nc-sa/4.0/openAccesshttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://prod-dspace.uminho.pt/bitstreams/7cf8ddda-e0fd-49cc-b6c9-3be09db97055/download
spellingShingle	A generalized pipeline infraestructure for developing multiple ML algorithms Vieira, Catarina Pais Machine Learning Software Architecture Data management Argo workflows
status	SINGLETON
subject.fl_str_mv	Machine Learning Software Architecture Data management Argo workflows
title	A generalized pipeline infraestructure for developing multiple ML algorithms
title_full	A generalized pipeline infraestructure for developing multiple ML algorithms
title_fullStr	A generalized pipeline infraestructure for developing multiple ML algorithms
title_full_unstemmed	A generalized pipeline infraestructure for developing multiple ML algorithms
title_short	A generalized pipeline infraestructure for developing multiple ML algorithms
title_sort	A generalized pipeline infraestructure for developing multiple ML algorithms
topic	Machine Learning Software Architecture Data management Argo workflows
topic_facet	Machine Learning Software Architecture Data management Argo workflows
url	https://hdl.handle.net/1822/93200
visible	1

Publicação

A generalized pipeline infraestructure for developing multiple ML algorithms

Registos relacionados