Publicação

A generalized pipeline infraestructure for developing multiple ML algorithms

Detalhes bibliográficos
Resumo:	Machine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.
Autores principais:	Vieira, Catarina Pais
Assunto:	Machine Learning Software Architecture Data management Argo workflows
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade do Minho
Idioma:	inglês
Origem:	RepositóriUM - Universidade do Minho

Descrição
Resumo:	Machine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.