Publicação

Navigating the mobile app Galaxy: Harnessing textual metadata for app categorization

Detalhes bibliográficos
Resumo:	This study conducts a comparative analysis of text representation and feature extraction methods for categorizing mobile applications into predefined categories. Effective categorization improves application discoverability, user experience, and application ecosystem organization. To develop an automatic approach for categorizing mobile applications into predefined categories, we used Word2Vec, Labeled Latent Dirichlet Allocation (L-LDA), pre-trained language models and RoBERTa to generate numerical semantic representations of the application descriptions. These representations were then used to classify the apps into predefined categories. Our classification system assigned each app to the same category or categories as it appears on Aptoide, allowing us to evaluate the effectiveness of the methods. Since we are dealing with multi-label classification, we used Classifier Chains, Label PowerSet, Binary Relevance and Multi-Label Binarizer to handle label dependencies and optimize classification performance. Our dataset of mobile apps, consisting of 9,163 entries, was obtained using APIs from Aptoide. The results show that our best text representation model, when properly tuned, is RoBERTa, which has the highest F1 scores in the micro, macro, weighted averages and samples categories. It is closely followed by the pre-trained GPT-4o model, which also performs well, but falls slightly short in comparison. Future research directions include the integration of multimodal data, exploring federated learning, adapting to evolving taxonomies, developing interactive and explainable AI systems, conducting cross-language and cross-cultural studies, creating personalized categorization models, assessing ethical implications, integrating with application development lifecycles and using gamification to enhance user engagement.
Autores principais:	D'Oliveira, Pedro Afonso Marques
Assunto:	Multi label classification Mobile application categorization Text representation models API data integration Classificação de múltiplas etiquetas Categorização de aplicações móveis Modelos de representação de texto API integração de dados
Ano:	2024
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	ISCTE
Idioma:	inglês
Origem:	Repositório ISCTE

Descrição
Resumo:	This study conducts a comparative analysis of text representation and feature extraction methods for categorizing mobile applications into predefined categories. Effective categorization improves application discoverability, user experience, and application ecosystem organization. To develop an automatic approach for categorizing mobile applications into predefined categories, we used Word2Vec, Labeled Latent Dirichlet Allocation (L-LDA), pre-trained language models and RoBERTa to generate numerical semantic representations of the application descriptions. These representations were then used to classify the apps into predefined categories. Our classification system assigned each app to the same category or categories as it appears on Aptoide, allowing us to evaluate the effectiveness of the methods. Since we are dealing with multi-label classification, we used Classifier Chains, Label PowerSet, Binary Relevance and Multi-Label Binarizer to handle label dependencies and optimize classification performance. Our dataset of mobile apps, consisting of 9,163 entries, was obtained using APIs from Aptoide. The results show that our best text representation model, when properly tuned, is RoBERTa, which has the highest F1 scores in the micro, macro, weighted averages and samples categories. It is closely followed by the pre-trained GPT-4o model, which also performs well, but falls slightly short in comparison. Future research directions include the integration of multimodal data, exploring federated learning, adapting to evolving taxonomies, developing interactive and explainable AI systems, conducting cross-language and cross-cultural studies, creating personalized categorization models, assessing ethical implications, integrating with application development lifecycles and using gamification to enhance user engagement.