Publicação

Automatic task discovery : towards full automation of the machine learning lifecycle

Ver documento

Detalhes bibliográficos
Resumo:Our world generates a vast amount of data daily, leading to a steady increase in the demand for skilled data scientists. However, employers struggle to meet this demand for data scientists. Thus, the field of Automated Machine Learning (AutoML) has emerged, focusing on automating the stages of the Machine Learning (ML) lifecycle. While many stages of the ML lifecycle have already been fully automated, limited research has been directed toward automating the stage of task framing. This work introduces the first data-driven approach for Automatic Task Discovery (ATD) of supervised ML tasks. The ML task frame representation language MATA is proposed, which is used to label a data corpus comprising 256 datasets with ML task frames. ATD is formulated as a multi-class classification task, for which the column-type annotation frameworks TURL [24] and Sato [102] are fine-tuned and compared. TURL achieves superior performance when compared to Sato and is further enhanced through data augmentation and hyperparameter optimization. The resulting model achieves an impressive F1-score of 0.847 on the test set. Additionally, a qualitative analysis of the generated task frames demonstrates the model’s ability to discover meaningful ML task frames. With this work, a novel data-driven approach for automatic discovery of supervised ML tasks is presented, ready for integration into existing AutoML frameworks.
Autores principais:Gehmayr, Jonathan
Assunto:Aprendizagem automática de máquinas Aprendizagem de máquina Processamento de linguagem natural Anotação de tipo de coluna Transformador Teses de mestrado - 2024
Ano:2024
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade de Lisboa
Idioma:inglês
Origem:Repositório da Universidade de Lisboa
Descrição
Resumo:Our world generates a vast amount of data daily, leading to a steady increase in the demand for skilled data scientists. However, employers struggle to meet this demand for data scientists. Thus, the field of Automated Machine Learning (AutoML) has emerged, focusing on automating the stages of the Machine Learning (ML) lifecycle. While many stages of the ML lifecycle have already been fully automated, limited research has been directed toward automating the stage of task framing. This work introduces the first data-driven approach for Automatic Task Discovery (ATD) of supervised ML tasks. The ML task frame representation language MATA is proposed, which is used to label a data corpus comprising 256 datasets with ML task frames. ATD is formulated as a multi-class classification task, for which the column-type annotation frameworks TURL [24] and Sato [102] are fine-tuned and compared. TURL achieves superior performance when compared to Sato and is further enhanced through data augmentation and hyperparameter optimization. The resulting model achieves an impressive F1-score of 0.847 on the test set. Additionally, a qualitative analysis of the generated task frames demonstrates the model’s ability to discover meaningful ML task frames. With this work, a novel data-driven approach for automatic discovery of supervised ML tasks is presented, ready for integration into existing AutoML frameworks.