Publicação
Topic Modeling of Multilingual Customer Reviews: A Study in the Running Footwear Domain
| Resumo: | This thesis investigates the use of multilingual topic modeling for analyzing short online reviews about running shoes. A custom dataset of approximately 30,000 reviews written in Italian, English, and French was collected through web scraping. The aim was to extract recurring themes from these reviews using BERTopic, a neural topic modeling technique based on sentence embeddings, UMAP, HDBSCAN, and class-based TF-IDF. The analysis followed the CRISP-DM framework and involved multiple iterations of preprocessing, modeling, and evaluation. Given the absence of labeled data, sentiment was approximated using the star rating provided by users. An initial attempt to apply Aspect-Based Sentiment Analysis (ABSA) was discarded due to the lack of annotated data and unsatisfactory early results. The multilingual version of BERTopic successfully revealed interpretable themes such as fit, cushioning, durability, and performance. Nevertheless, several limitations emerged. Many reviews were extremely short or generic, reducing topic coherence. Language imbalance introduced biases in topic frequency, and limited computing power constrained the scope of experimentation. Despite these constraints, the results demonstrate the potential of multilingual topic modeling as a scalable and language-flexible approach to extracting actionable insights from unstructured customer feedback. Future research may improve these outcomes by employing larger transformer-based models, developing better preprocessing for short texts, and incorporating human validation to enhance topic interpretability. |
|---|---|
| Autores principais: | Bovenga, Giulia |
| Assunto: | Topic Modeling BERTopic Multilingual Texts Running Shoes Natural Language Processing |
| Ano: | 2025 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso embargado |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | This thesis investigates the use of multilingual topic modeling for analyzing short online reviews about running shoes. A custom dataset of approximately 30,000 reviews written in Italian, English, and French was collected through web scraping. The aim was to extract recurring themes from these reviews using BERTopic, a neural topic modeling technique based on sentence embeddings, UMAP, HDBSCAN, and class-based TF-IDF. The analysis followed the CRISP-DM framework and involved multiple iterations of preprocessing, modeling, and evaluation. Given the absence of labeled data, sentiment was approximated using the star rating provided by users. An initial attempt to apply Aspect-Based Sentiment Analysis (ABSA) was discarded due to the lack of annotated data and unsatisfactory early results. The multilingual version of BERTopic successfully revealed interpretable themes such as fit, cushioning, durability, and performance. Nevertheless, several limitations emerged. Many reviews were extremely short or generic, reducing topic coherence. Language imbalance introduced biases in topic frequency, and limited computing power constrained the scope of experimentation. Despite these constraints, the results demonstrate the potential of multilingual topic modeling as a scalable and language-flexible approach to extracting actionable insights from unstructured customer feedback. Future research may improve these outcomes by employing larger transformer-based models, developing better preprocessing for short texts, and incorporating human validation to enhance topic interpretability. |
|---|