Publicação
Automatic task discovery : towards full automation of the machine learning lifecycle
| Resumo: | Our world generates a vast amount of data daily, leading to a steady increase in the demand for skilled data scientists. However, employers struggle to meet this demand for data scientists. Thus, the field of Automated Machine Learning (AutoML) has emerged, focusing on automating the stages of the Machine Learning (ML) lifecycle. While many stages of the ML lifecycle have already been fully automated, limited research has been directed toward automating the stage of task framing. This work introduces the first data-driven approach for Automatic Task Discovery (ATD) of supervised ML tasks. The ML task frame representation language MATA is proposed, which is used to label a data corpus comprising 256 datasets with ML task frames. ATD is formulated as a multi-class classification task, for which the column-type annotation frameworks TURL [24] and Sato [102] are fine-tuned and compared. TURL achieves superior performance when compared to Sato and is further enhanced through data augmentation and hyperparameter optimization. The resulting model achieves an impressive F1-score of 0.847 on the test set. Additionally, a qualitative analysis of the generated task frames demonstrates the model’s ability to discover meaningful ML task frames. With this work, a novel data-driven approach for automatic discovery of supervised ML tasks is presented, ready for integration into existing AutoML frameworks. |
|---|---|
| Autores principais: | Gehmayr, Jonathan |
| Assunto: | Aprendizagem automática de máquinas Aprendizagem de máquina Processamento de linguagem natural Anotação de tipo de coluna Transformador Teses de mestrado - 2024 |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório da Universidade de Lisboa |
| Resumo: | Our world generates a vast amount of data daily, leading to a steady increase in the demand for skilled data scientists. However, employers struggle to meet this demand for data scientists. Thus, the field of Automated Machine Learning (AutoML) has emerged, focusing on automating the stages of the Machine Learning (ML) lifecycle. While many stages of the ML lifecycle have already been fully automated, limited research has been directed toward automating the stage of task framing. This work introduces the first data-driven approach for Automatic Task Discovery (ATD) of supervised ML tasks. The ML task frame representation language MATA is proposed, which is used to label a data corpus comprising 256 datasets with ML task frames. ATD is formulated as a multi-class classification task, for which the column-type annotation frameworks TURL [24] and Sato [102] are fine-tuned and compared. TURL achieves superior performance when compared to Sato and is further enhanced through data augmentation and hyperparameter optimization. The resulting model achieves an impressive F1-score of 0.847 on the test set. Additionally, a qualitative analysis of the generated task frames demonstrates the model’s ability to discover meaningful ML task frames. With this work, a novel data-driven approach for automatic discovery of supervised ML tasks is presented, ready for integration into existing AutoML frameworks. |
|---|