Publicação
Improving Multiclass Classification for Software Issue Triage: A Comparative Study of Embedding Methods and Class Imbalance Strategies
| Resumo: | As software systems evolve rapidly and the volume of issue reports continues to grow, effective issue triage has become more critical than ever. Manual triage places a significant burden on IT technicians and developers, often leading to human errors, increased costs, and time inefficiencies. To address these challenges, many researchers have explored automated triage systems. However, the performance of such systems varies greatly depending on the methodologies employed, particularly the choice of text embedding techniques, machine learning algorithms, and class rebalancing methods. With recent advancements in Natural Language Processing (NLP) and the emergence of powerful Large Language Models (LLMs), there is a growing need to investigate whether these technologies can enhance automated triage systems. This study aims to bridge that gap by evaluating various text embedding methods (Word2Vec, ELMo, BERT, ModernBERT, and OpenAI’s embeddings) and three machine learning classifiers (LinearSVM, Random Forest, and Gradient Boosting). Given that real-world issue reports often suffer from class imbalance, we also applied two rebalancing techniques: SMOTE and LLM-based text augmentation. Our results show that OpenAI’s latest embedding model, text-embedding-3-small, combined with the LinearSVM classifier, achieved the highest F1 score. In terms of class rebalancing, SMOTE remains a reliable method, though LLM-based text augmentation also showed promise, depending on the dataset. Despite these encouraging findings, further research is required to explore more advanced embedding techniques, alternative classifiers, and improved augmentation strategies. Notably, the application of text augmentation in issue triage is still in its early stages, meaning that continued research is necessary to explore its potential. |
|---|---|
| Autores principais: | Park, Hyeonsuk |
| Assunto: | Software issue triage Bug Triage Text Classification Text Embedding Natural Language Processing Class Imbalance Machine Learning |
| Ano: | 2025 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | As software systems evolve rapidly and the volume of issue reports continues to grow, effective issue triage has become more critical than ever. Manual triage places a significant burden on IT technicians and developers, often leading to human errors, increased costs, and time inefficiencies. To address these challenges, many researchers have explored automated triage systems. However, the performance of such systems varies greatly depending on the methodologies employed, particularly the choice of text embedding techniques, machine learning algorithms, and class rebalancing methods. With recent advancements in Natural Language Processing (NLP) and the emergence of powerful Large Language Models (LLMs), there is a growing need to investigate whether these technologies can enhance automated triage systems. This study aims to bridge that gap by evaluating various text embedding methods (Word2Vec, ELMo, BERT, ModernBERT, and OpenAI’s embeddings) and three machine learning classifiers (LinearSVM, Random Forest, and Gradient Boosting). Given that real-world issue reports often suffer from class imbalance, we also applied two rebalancing techniques: SMOTE and LLM-based text augmentation. Our results show that OpenAI’s latest embedding model, text-embedding-3-small, combined with the LinearSVM classifier, achieved the highest F1 score. In terms of class rebalancing, SMOTE remains a reliable method, though LLM-based text augmentation also showed promise, depending on the dataset. Despite these encouraging findings, further research is required to explore more advanced embedding techniques, alternative classifiers, and improved augmentation strategies. Notably, the application of text augmentation in issue triage is still in its early stages, meaning that continued research is necessary to explore its potential. |
|---|