Publicação
A generalized pipeline infraestructure for developing multiple ML algorithms
| Resumo: | Machine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape. |
|---|---|
| Autores principais: | Vieira, Catarina Pais |
| Assunto: | Machine Learning Software Architecture Data management Argo workflows |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade do Minho |
| Idioma: | inglês |
| Origem: | RepositóriUM - Universidade do Minho |
| _version_ | 1866876004546379776 |
|---|---|
| author | Vieira, Catarina Pais |
| author_facet | Vieira, Catarina Pais |
| author_role | author |
| contributor_name_str_mv | Ferreira, André Leite Fernandes, João M. Universidade do Minho |
| country_str | PT |
| creators_json_txt | [{\"Person.name\":\"Vieira, Catarina Pais\"}] |
| datacite.contributors.contributor.contributorName.fl_str_mv | Ferreira, André Leite Fernandes, João M. Universidade do Minho |
| datacite.creators.creator.creatorName.fl_str_mv | Vieira, Catarina Pais |
| datacite.date.Accepted.fl_str_mv | 2024-04-09T00:00:00Z |
| datacite.date.available.fl_str_mv | 2024-10-05T15:57:06Z |
| datacite.date.embargoed.fl_str_mv | 2024-10-05T15:57:06Z |
| datacite.rights.fl_str_mv | http://purl.org/coar/access_right/c_abf2 |
| datacite.subjects.subject.fl_str_mv | Machine Learning Software Architecture Data management Argo workflows |
| datacite.titles.title.fl_str_mv | A generalized pipeline infraestructure for developing multiple ML algorithms Uma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de ML |
| dc.contributor.none.fl_str_mv | Ferreira, André Leite Fernandes, João M. Universidade do Minho |
| dc.creator.none.fl_str_mv | Vieira, Catarina Pais |
| dc.date.Accepted.fl_str_mv | 2024-04-09T00:00:00Z |
| dc.date.available.fl_str_mv | 2024-10-05T15:57:06Z |
| dc.date.embargoed.fl_str_mv | 2024-10-05T15:57:06Z |
| dc.format.none.fl_str_mv | application/pdf |
| dc.identifier.none.fl_str_mv | https://hdl.handle.net/1822/93200 |
| dc.language.none.fl_str_mv | eng |
| dc.rights.cclincense.fl_str_mv | http://creativecommons.org/licenses/by-nc-sa/4.0/ |
| dc.rights.none.fl_str_mv | http://purl.org/coar/access_right/c_abf2 |
| dc.rights.rights.copyright.fl_str_mv | openAccess |
| dc.subject.none.fl_str_mv | Machine Learning Software Architecture Data management Argo workflows |
| dc.title.fl_str_mv | A generalized pipeline infraestructure for developing multiple ML algorithms Uma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de ML |
| dc.type.none.fl_str_mv | http://purl.org/coar/resource_type/c_bdcc |
| description | Machine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape. |
| dirty | 0 |
| eu_rights_str_mv | openAccess |
| format | masterThesis |
| fulltext.url.fl_str_mv | https://prod-dspace.uminho.pt/bitstreams/7cf8ddda-e0fd-49cc-b6c9-3be09db97055/download |
| id | rum_717cc489dfa92faa4c8bf6904e3fcd9d |
| identifier.url.fl_str_mv | https://hdl.handle.net/1822/93200 |
| instacron_str | repositorium |
| institution | Universidade do Minho |
| instname_str | Universidade do Minho |
| language | eng |
| network_acronym_str | rum |
| network_name_str | RepositóriUM - Universidade do Minho |
| oai_identifier_str | oai:repositorium.uminho.pt:1822/93200 |
| organization_str_mv | urn:organizationAcronym:repositorium |
| person_str_mv | Vieira, Catarina Pais |
| publishDate | 2024 |
| reponame_str | RepositóriUM - Universidade do Minho |
| repository_id_str | urn:repositoryAcronym:rum |
| service_str_mv | urn:repositoryAcronym:rum |
| spelling | engporMachine Leaning (ML) development is a very experimental, repetitive, and error-prone task, because knowing what model works best for our goals beforehand is very hard, so practitioners have an incentive to experiment with as many models, approaches, and techniques as they can. Going from raw data to a well-adjusted model is not a trivial process, often requiring complex, multi-step pipelines. Additionally, one of the areas where these pipelines arguably change the most is the pre-processing of raw data. This dissertation aims to simplify the work of ML engineers by providing a singular platform where they can process data and run models, fostering efficiency and ease of use. The focus, therefore, is on developing a generalized ML pipeline that serves as a versatile solution accessible to a wide range of engineers, streamlining their processes and contributing to a more cohesive and efficient workflow. To assess the viability of this objective, we implement the Data Management phase of a ML lifecycle. This involves a case study that uses two Machine Learning Operations (MLOPs) solutions: one utilizing an event-driven approach, and the other adopting a monolithic structure, both handling a shared sensor data type. By integrating segments of both solutions, specifically their Extract Transform Load (ETL) pipelines, two architectural proposals emerge. One follows the Attribute-driven design (ADD) process, which is a methodology that prioritizes and guides software design decisions based on critical system attributes, ensuring effective fulfillment of quality requirements. The second, a non-procedural approach orchestrated with tools like Argo Workflows, takes a significant y different path, which was the one approved by the stakeholders. Crucially, findings reveal that constructing an infrastructure supporting a generalized ML pipeline is indeed feasible. However, the shift from a robust event-driven architecture to an orchestrator-based one, like Argo Workflows, comes at a cost. There are notable trade-offs in terms of both expense and performance due to the reduced availability of workflows. The dissertation concludes by recommending a course of action based on these valuable insights, acknowledging the dynamic nature of the Al development landscape.application/pdfporA generalized pipeline infraestructure for developing multiple ML algorithmsAlternativeTitleporUma infraestrutura de uma pipeline generalizada para o desenvolvimento de múltiplos algoritmos de MLVieira, Catarina PaisFerreira, André LeiteFernandes, João M.HostingInstitutionOrganizationalUniversidade do Minhoe-mailmailto:repositorium@usdb.uminho.ptrepositorium@usdb.uminho.ptURNurn:tid:2036687662024-10-05T15:57:06Z2024-04-092024-012024-04-09T00:00:00ZHandlehttps://hdl.handle.net/1822/93200http://purl.org/coar/access_right/c_abf2open accessMachine LearningSoftware ArchitectureData managementArgo workflows15081531 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesis2024-04-09http://creativecommons.org/licenses/by-nc-sa/4.0/openAccesshttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://prod-dspace.uminho.pt/bitstreams/7cf8ddda-e0fd-49cc-b6c9-3be09db97055/download |
| spellingShingle | A generalized pipeline infraestructure for developing multiple ML algorithms Vieira, Catarina Pais Machine Learning Software Architecture Data management Argo workflows |
| status | SINGLETON |
| subject.fl_str_mv | Machine Learning Software Architecture Data management Argo workflows |
| title | A generalized pipeline infraestructure for developing multiple ML algorithms |
| title_full | A generalized pipeline infraestructure for developing multiple ML algorithms |
| title_fullStr | A generalized pipeline infraestructure for developing multiple ML algorithms |
| title_full_unstemmed | A generalized pipeline infraestructure for developing multiple ML algorithms |
| title_short | A generalized pipeline infraestructure for developing multiple ML algorithms |
| title_sort | A generalized pipeline infraestructure for developing multiple ML algorithms |
| topic | Machine Learning Software Architecture Data management Argo workflows |
| topic_facet | Machine Learning Software Architecture Data management Argo workflows |
| url | https://hdl.handle.net/1822/93200 |
| visible | 1 |