Publicação

COMPRESSED LEARNING FOR TEXT CATEGORIZATION

Detalhes bibliográficos
Resumo:	In text classification based on the bag-of-words (BoW) or similar representations, we usually have a large number of features, many of which are irrelevant (or even detrimental) for classification tasks. Recent results show that compressed learning (CL), i.e., learning in a domain of reduced dimensionality obtained by random projections (RP), is possible, and theoretical bounds on the test set error rate have been shown. In this work, we assess the performance of CL, based on RP of BoW representations for text classification. Our experimental results show that CL significantly reduces the number of features and the training time, while simultaneously improving the classification accuracy. Rather than the mild decrease in accuracy upper bounded by the theory, we actually find an increase of accuracy. Our approach is further compared against two techniques, namely the unsupervised random subspaces method and the supervised Fisher index. The CL approach is suited for unsupervised or semi-supervised learning, without any modification, since it does not use the class labels.
Autores principais:	Ferreira, Artur
Outros Autores:	Figueiredo, Mario
Assunto:	Computers; Machine Learning random projections, random subspaces, compressed learning, text classification, support vector machines
Ano:	2013
País:	Portugal
Tipo de documento:	artigo
Tipo de acesso:	unknown
Instituição associada:	Instituto Superior de Engenharia de Lisboa
Idioma:	inglês
Origem:	i-ETC : ISEL Academic Journal of Electronics Telecommunications and Computers

Descrição
Resumo:	In text classification based on the bag-of-words (BoW) or similar representations, we usually have a large number of features, many of which are irrelevant (or even detrimental) for classification tasks. Recent results show that compressed learning (CL), i.e., learning in a domain of reduced dimensionality obtained by random projections (RP), is possible, and theoretical bounds on the test set error rate have been shown. In this work, we assess the performance of CL, based on RP of BoW representations for text classification. Our experimental results show that CL significantly reduces the number of features and the training time, while simultaneously improving the classification accuracy. Rather than the mild decrease in accuracy upper bounded by the theory, we actually find an increase of accuracy. Our approach is further compared against two techniques, namely the unsupervised random subspaces method and the supervised Fisher index. The CL approach is suited for unsupervised or semi-supervised learning, without any modification, since it does not use the class labels.