Publicação

A generalized pipeline infraestructure for developing multiple ML algorithms

Ver documento

Detalhes bibliográficos
Resumo:Machine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.
Autores principais:Vieira, Catarina Pais
Assunto:Machine Learning Software Architecture Data management Argo workflows
Ano:2024
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade do Minho
Idioma:inglês
Origem:RepositóriUM - Universidade do Minho
_version_ 1866876004546379776
author Vieira, Catarina Pais
author_facet Vieira, Catarina Pais
author_role author
contributor_name_str_mv Ferreira, André Leite
Fernandes, João M.
Universidade do Minho
country_str PT
creators_json_txt [{\"Person.name\":\"Vieira, Catarina Pais\"}]
datacite.contributors.contributor.contributorName.fl_str_mv Ferreira, André Leite
Fernandes, João M.
Universidade do Minho
datacite.creators.creator.creatorName.fl_str_mv Vieira, Catarina Pais
datacite.date.Accepted.fl_str_mv 2024-04-09T00:00:00Z
datacite.date.available.fl_str_mv 2024-10-05T15:57:06Z
datacite.date.embargoed.fl_str_mv 2024-10-05T15:57:06Z
datacite.rights.fl_str_mv http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv Machine Learning
Software Architecture
Data management
Argo workflows
datacite.titles.title.fl_str_mv A generalized pipeline infraestructure for developing multiple ML algorithms
Uma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de ML
dc.contributor.none.fl_str_mv Ferreira, André Leite
Fernandes, João M.
Universidade do Minho
dc.creator.none.fl_str_mv Vieira, Catarina Pais
dc.date.Accepted.fl_str_mv 2024-04-09T00:00:00Z
dc.date.available.fl_str_mv 2024-10-05T15:57:06Z
dc.date.embargoed.fl_str_mv 2024-10-05T15:57:06Z
dc.format.none.fl_str_mv application/pdf
dc.identifier.none.fl_str_mv https://hdl.handle.net/1822/93200
dc.language.none.fl_str_mv eng
dc.rights.cclincense.fl_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.none.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.rights.copyright.fl_str_mv openAccess
dc.subject.none.fl_str_mv Machine Learning
Software Architecture
Data management
Argo workflows
dc.title.fl_str_mv A generalized pipeline infraestructure for developing multiple ML algorithms
Uma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de ML
dc.type.none.fl_str_mv http://purl.org/coar/resource_type/c_bdcc
description Machine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.
dirty 0
eu_rights_str_mv openAccess
format masterThesis
fulltext.url.fl_str_mv https://prod-dspace.uminho.pt/bitstreams/7cf8ddda-e0fd-49cc-b6c9-3be09db97055/download
id rum_717cc489dfa92faa4c8bf6904e3fcd9d
identifier.url.fl_str_mv https://hdl.handle.net/1822/93200
instacron_str repositorium
institution Universidade do Minho
instname_str Universidade do Minho
language eng
network_acronym_str rum
network_name_str RepositóriUM - Universidade do Minho
oai_identifier_str oai:repositorium.uminho.pt:1822/93200
organization_str_mv urn:organizationAcronym:repositorium
person_str_mv Vieira, Catarina Pais
publishDate 2024
reponame_str RepositóriUM - Universidade do Minho
repository_id_str urn:repositoryAcronym:rum
service_str_mv urn:repositoryAcronym:rum
spelling engporMachine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.application/pdfporA generalized pipeline infraestructure for developing multiple ML algorithmsAlternativeTitleporUma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de MLVieira, Catarina PaisFerreira, André LeiteFernandes, João M.HostingInstitutionOrganizationalUniversidade do Minhoe-mailmailto:repositorium@usdb.uminho.ptrepositorium@usdb.uminho.ptURNurn:tid:2036687662024-10-05T15:57:06Z2024-04-092024-012024-04-09T00:00:00ZHandlehttps://hdl.handle.net/1822/93200http://purl.org/coar/access_right/c_abf2open accessMachine LearningSoftware ArchitectureData managementArgo workflows15081531 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesis2024-04-09http://creativecommons.org/licenses/by-nc-sa/4.0/openAccesshttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://prod-dspace.uminho.pt/bitstreams/7cf8ddda-e0fd-49cc-b6c9-3be09db97055/download
spellingShingle A generalized pipeline infraestructure for developing multiple ML algorithms
Vieira, Catarina Pais
Machine Learning
Software Architecture
Data management
Argo workflows
status SINGLETON
subject.fl_str_mv Machine Learning
Software Architecture
Data management
Argo workflows
title A generalized pipeline infraestructure for developing multiple ML algorithms
title_full A generalized pipeline infraestructure for developing multiple ML algorithms
title_fullStr A generalized pipeline infraestructure for developing multiple ML algorithms
title_full_unstemmed A generalized pipeline infraestructure for developing multiple ML algorithms
title_short A generalized pipeline infraestructure for developing multiple ML algorithms
title_sort A generalized pipeline infraestructure for developing multiple ML algorithms
topic Machine Learning
Software Architecture
Data management
Argo workflows
topic_facet Machine Learning
Software Architecture
Data management
Argo workflows
url https://hdl.handle.net/1822/93200
visible 1