Publicação

Automating news classification with large language models: Exploring fine-tuning, dataset size, and architecture

Detalhes bibliográficos
Resumo:	A comprehensive evaluation benchmarks Large Language Models with traditional machine learning algorithms for automatic news classification is done on three standard news classification datasets: BBC News, 20 Newsgroups, and AG News. We implement traditional models, including Naive Bayes, Logistic Regression, Support Vector Machine, and Random Forest, to provide clear and interpretable baselines using manual term-frequency and syntactic features. Then, fine‐tuned transformer architectures, including BERT, RoBERTa, T5, GPT, and their distilled variants, were used to quantify improvements in predictive accuracy, resource efficiency, and explainability. Performance is measured via 5-fold cross-validation using F1 and accuracy metrics, and statistical significance is assessed with a Friedman test followed by Holm’s correction. Results show that transformer models consistently outperform classical approaches, with BERT achieving the highest scores under both balanced and imbalanced conditions. Distilled models rival or surpass full-size transformers on larger datasets while reducing memory requirements and maintaining comparable inference latency. Attention‐based attribution methods provide semantic explanations on par with feature‐importance metrics, confirming that LLMs deliver superior accuracy, adaptability, and transparency in news classification. Future work should investigate multilingual pretraining, multilabel classification, and ensemble techniques to further strengthen real‐time, explainable news‐analysis pipelines.
Autores principais:	Yesilyurt, Burcu
Assunto:	Large Language Models Text Classification News Classification Fine-Tuning Hyperparameter Optimization BERT SDG 4 - Quality education SDG 9 - Industry, innovation and infrastructure SDG 16 - Peace, justice and strong institutions
Ano:	2025
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso embargado
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

Descrição
Resumo:	A comprehensive evaluation benchmarks Large Language Models with traditional machine learning algorithms for automatic news classification is done on three standard news classification datasets: BBC News, 20 Newsgroups, and AG News. We implement traditional models, including Naive Bayes, Logistic Regression, Support Vector Machine, and Random Forest, to provide clear and interpretable baselines using manual term-frequency and syntactic features. Then, fine‐tuned transformer architectures, including BERT, RoBERTa, T5, GPT, and their distilled variants, were used to quantify improvements in predictive accuracy, resource efficiency, and explainability. Performance is measured via 5-fold cross-validation using F1 and accuracy metrics, and statistical significance is assessed with a Friedman test followed by Holm’s correction. Results show that transformer models consistently outperform classical approaches, with BERT achieving the highest scores under both balanced and imbalanced conditions. Distilled models rival or surpass full-size transformers on larger datasets while reducing memory requirements and maintaining comparable inference latency. Attention‐based attribution methods provide semantic explanations on par with feature‐importance metrics, confirming that LLMs deliver superior accuracy, adaptability, and transparency in news classification. Future work should investigate multilingual pretraining, multilabel classification, and ensemble techniques to further strengthen real‐time, explainable news‐analysis pipelines.