Publicação
Navigating the mobile app Galaxy: Harnessing textual metadata for app categorization
| Resumo: | This study conducts a comparative analysis of text representation and feature extraction methods for categorizing mobile applications into predefined categories. Effective categorization improves application discoverability, user experience, and application ecosystem organization. To develop an automatic approach for categorizing mobile applications into predefined categories, we used Word2Vec, Labeled Latent Dirichlet Allocation (L-LDA), pre-trained language models and RoBERTa to generate numerical semantic representations of the application descriptions. These representations were then used to classify the apps into predefined categories. Our classification system assigned each app to the same category or categories as it appears on Aptoide, allowing us to evaluate the effectiveness of the methods. Since we are dealing with multi-label classification, we used Classifier Chains, Label PowerSet, Binary Relevance and Multi-Label Binarizer to handle label dependencies and optimize classification performance. Our dataset of mobile apps, consisting of 9,163 entries, was obtained using APIs from Aptoide. The results show that our best text representation model, when properly tuned, is RoBERTa, which has the highest F1 scores in the micro, macro, weighted averages and samples categories. It is closely followed by the pre-trained GPT-4o model, which also performs well, but falls slightly short in comparison. Future research directions include the integration of multimodal data, exploring federated learning, adapting to evolving taxonomies, developing interactive and explainable AI systems, conducting cross-language and cross-cultural studies, creating personalized categorization models, assessing ethical implications, integrating with application development lifecycles and using gamification to enhance user engagement. |
|---|---|
| Autores principais: | D'Oliveira, Pedro Afonso Marques |
| Assunto: | Multi label classification Mobile application categorization Text representation models API data integration Classificação de múltiplas etiquetas Categorização de aplicações móveis Modelos de representação de texto API integração de dados |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | ISCTE |
| Idioma: | inglês |
| Origem: | Repositório ISCTE |
| Resumo: | This study conducts a comparative analysis of text representation and feature extraction methods for categorizing mobile applications into predefined categories. Effective categorization improves application discoverability, user experience, and application ecosystem organization. To develop an automatic approach for categorizing mobile applications into predefined categories, we used Word2Vec, Labeled Latent Dirichlet Allocation (L-LDA), pre-trained language models and RoBERTa to generate numerical semantic representations of the application descriptions. These representations were then used to classify the apps into predefined categories. Our classification system assigned each app to the same category or categories as it appears on Aptoide, allowing us to evaluate the effectiveness of the methods. Since we are dealing with multi-label classification, we used Classifier Chains, Label PowerSet, Binary Relevance and Multi-Label Binarizer to handle label dependencies and optimize classification performance. Our dataset of mobile apps, consisting of 9,163 entries, was obtained using APIs from Aptoide. The results show that our best text representation model, when properly tuned, is RoBERTa, which has the highest F1 scores in the micro, macro, weighted averages and samples categories. It is closely followed by the pre-trained GPT-4o model, which also performs well, but falls slightly short in comparison. Future research directions include the integration of multimodal data, exploring federated learning, adapting to evolving taxonomies, developing interactive and explainable AI systems, conducting cross-language and cross-cultural studies, creating personalized categorization models, assessing ethical implications, integrating with application development lifecycles and using gamification to enhance user engagement. |
|---|