Publicação

Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis

Ver documento

Detalhes bibliográficos
Resumo:ML pipelines are composed of several steps that load data, clean it, process it, apply learning algorithms and produce either reports or deploy inference systems into production. In real-world scenarios, pipelines can take days, weeks, or months to train with large quantities of data. Unfortunately, current tools to design and orchestrate ML pipelines are oblivious to the semantics of each step, allowing developers to easily introduce errors when connecting two components that might not work together, either syntactically or semantically. Data scientists and engineers often find these bugs during or after the lengthy execution, which decreases their productivity. We propose a Visual Programming Language (VPL) enriched with semantic constraints regarding the behavior of each component and a verification methodology that verifies entire pipelines to detect common ML bugs that existing visual and textual programming languages do not. We evaluate this methodology on a set of six bugs taken from a data science company focused on preventing financial fraud on big data. We were able detect these data engineering and data balancing bugs, as well as detect unnecessary computation in the pipelines.
Autores principais:David, João Pedro Vieira
Assunto:Programação Visual Aprendizagem Automática Pipeline Verificação de Tipos Compilador Teses de mestrado - 2021
Ano:2021
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade de Lisboa
Idioma:inglês
Origem:Repositório da Universidade de Lisboa
_version_ 1866811393929379840
author David, João Pedro Vieira
author_facet David, João Pedro Vieira
author_role author
contributor_name_str_mv Fonseca, Alcides Miguel Cachulo Aguiar
Repositório Científico de Acesso Aberto da ULisboa
country_str PT
creators_json_txt [{\"Person.name\":\"David, João Pedro Vieira\"}]
datacite.contributors.contributor.contributorName.fl_str_mv Fonseca, Alcides Miguel Cachulo Aguiar
Repositório Científico de Acesso Aberto da ULisboa
datacite.creators.creator.creatorName.fl_str_mv David, João Pedro Vieira
datacite.date.Accepted.fl_str_mv 2021-01-01T00:00:00Z
datacite.date.available.fl_str_mv 2022-03-25T13:48:33Z
datacite.date.embargoed.fl_str_mv 2022-03-25T13:48:33Z
datacite.rights.fl_str_mv http://purl.org/coar/access_right/c_abf2
datacite.subjects.subject.fl_str_mv Programação Visual
Aprendizagem Automática
Pipeline
Verificação de Tipos
Compilador
Teses de mestrado - 2021
datacite.titles.title.fl_str_mv Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
dc.contributor.none.fl_str_mv Fonseca, Alcides Miguel Cachulo Aguiar
Repositório Científico de Acesso Aberto da ULisboa
dc.creator.none.fl_str_mv David, João Pedro Vieira
dc.date.Accepted.fl_str_mv 2021-01-01T00:00:00Z
dc.date.available.fl_str_mv 2022-03-25T13:48:33Z
dc.date.embargoed.fl_str_mv 2022-03-25T13:48:33Z
dc.format.none.fl_str_mv application/pdf
dc.identifier.none.fl_str_mv http://hdl.handle.net/10451/51973
dc.language.none.fl_str_mv eng
dc.rights.none.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.subject.none.fl_str_mv Programação Visual
Aprendizagem Automática
Pipeline
Verificação de Tipos
Compilador
Teses de mestrado - 2021
dc.title.fl_str_mv Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
dc.type.none.fl_str_mv http://purl.org/coar/resource_type/c_bdcc
description ML pipelines are composed of several steps that load data, clean it, process it, apply learning algorithms and produce either reports or deploy inference systems into production. In real-world scenarios, pipelines can take days, weeks, or months to train with large quantities of data. Unfortunately, current tools to design and orchestrate ML pipelines are oblivious to the semantics of each step, allowing developers to easily introduce errors when connecting two components that might not work together, either syntactically or semantically. Data scientists and engineers often find these bugs during or after the lengthy execution, which decreases their productivity. We propose a Visual Programming Language (VPL) enriched with semantic constraints regarding the behavior of each component and a verification methodology that verifies entire pipelines to detect common ML bugs that existing visual and textual programming languages do not. We evaluate this methodology on a set of six bugs taken from a data science company focused on preventing financial fraud on big data. We were able detect these data engineering and data balancing bugs, as well as detect unnecessary computation in the pipelines.
dirty 0
eu_rights_str_mv openAccess
format masterThesis
fulltext.url.fl_str_mv https://repositorio.ulisboa.pt/bitstreams/7346f002-c5d7-4e20-af67-b97423763193/download
id ul_de819dc894a365dfbc5d2f541c6fef8a
identifier.url.fl_str_mv http://hdl.handle.net/10451/51973
instacron_str ul
institution Universidade de Lisboa
instname_str Universidade de Lisboa
language eng
network_acronym_str ul
network_name_str Repositório da Universidade de Lisboa
oai_identifier_str oai:repositorio.ulisboa.pt:10451/51973
organization_str_mv urn:organizationAcronym:ul
person_str_mv David, João Pedro Vieira
publishDate 2021
reponame_str Repositório da Universidade de Lisboa
repository_id_str urn:repositoryAcronym:ul
service_str_mv urn:repositoryAcronym:ul
spelling engpt_PTML pipelines are composed of several steps that load data, clean it, process it, apply learning algorithms and produce either reports or deploy inference systems into production. In real-world scenarios, pipelines can take days, weeks, or months to train with large quantities of data. Unfortunately, current tools to design and orchestrate ML pipelines are oblivious to the semantics of each step, allowing developers to easily introduce errors when connecting two components that might not work together, either syntactically or semantically. Data scientists and engineers often find these bugs during or after the lengthy execution, which decreases their productivity. We propose a Visual Programming Language (VPL) enriched with semantic constraints regarding the behavior of each component and a verification methodology that verifies entire pipelines to detect common ML bugs that existing visual and textual programming languages do not. We evaluate this methodology on a set of six bugs taken from a data science company focused on preventing financial fraud on big data. We were able detect these data engineering and data balancing bugs, as well as detect unnecessary computation in the pipelines.application/pdfpt_PTImproving Machine Learning Pipeline Creation using Visual Programming and Static AnalysisDavid, João Pedro VieiraFonseca, Alcides Miguel Cachulo AguiarHostingInstitutionOrganizationalRepositório Científico de Acesso Aberto da ULisboae-mailmailto:repositorio@reitoria.ulisboa.ptrepositorio@reitoria.ulisboa.ptURNurn:tid:2029340712022-03-25T13:48:33Z202120212021-01-01T00:00:00ZHandlehttp://hdl.handle.net/10451/51973http://purl.org/coar/access_right/c_abf2open accessProgramação VisualAprendizagem AutomáticaPipelineVerificação de TiposCompiladorTeses de mestrado - 20211995033 bytesliteraturehttp://purl.org/coar/resource_type/c_bdccmaster thesishttp://purl.org/coar/access_right/c_abf2application/pdffulltexthttps://repositorio.ulisboa.pt/bitstreams/7346f002-c5d7-4e20-af67-b97423763193/download
spellingShingle Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
David, João Pedro Vieira
Programação Visual
Aprendizagem Automática
Pipeline
Verificação de Tipos
Compilador
Teses de mestrado - 2021
status SINGLETON
subject.fl_str_mv Programação Visual
Aprendizagem Automática
Pipeline
Verificação de Tipos
Compilador
Teses de mestrado - 2021
title Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
title_full Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
title_fullStr Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
title_full_unstemmed Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
title_short Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
title_sort Improving Machine Learning Pipeline Creation using Visual Programming and Static Analysis
topic Programação Visual
Aprendizagem Automática
Pipeline
Verificação de Tipos
Compilador
Teses de mestrado - 2021
topic_facet Programação Visual
Aprendizagem Automática
Pipeline
Verificação de Tipos
Compilador
Teses de mestrado - 2021
url http://hdl.handle.net/10451/51973
visible 1