Publicação
Unsupervised models for audio emotion detection
| Resumo: | Traditionally, supervised methods are able to learn language models and understand human communication using data-intensive approaches. However, many languages and dialects have few or inexistent resources, being a major drawback to the development of Automatic Speech Recognition (ASR) systems. This work seeks to develop a complete unsupervised pipeline to detect emotions from raw audio signals of Via Directa (VD) call center recordings, in the European Portuguese language. To that end, a concise literature review about low-resource approaches for the subtasks of ASR, Speech Enhancement (SE), and Sentiment Analysis (SA) was done. Considering the SE task, a Wave-U-net model was successfully implemented, being able to denoise raw audio signals with average Segmental Signal-to-Noise Ratio (SSNR) scores above 3.9 and a 0.8% increase in Signal-to-Noise Ratio (SNR). For the SA task, a domain specific sentiment lexicon based on the SentiWordNet3.0 dictionary was developed for the European Portuguese language. Then, using a linear Support Vector Machine (SVM) baseline model for benchmarking, the Lex2Sent model was modified and its performance improved for binary classification of sentiment in the corresponding transcriptions, which achieved an F1 macro score of 0.584. Lastly, limitations are discussed with the goal of developing the remaining unsupervised ASR system. |
|---|---|
| Autores principais: | Bernardo, Miguel Ângelo Martins |
| Assunto: | NLP Unsupervised Learning Sentiment Analysis Speech Enhancement Lexicon |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | Traditionally, supervised methods are able to learn language models and understand human communication using data-intensive approaches. However, many languages and dialects have few or inexistent resources, being a major drawback to the development of Automatic Speech Recognition (ASR) systems. This work seeks to develop a complete unsupervised pipeline to detect emotions from raw audio signals of Via Directa (VD) call center recordings, in the European Portuguese language. To that end, a concise literature review about low-resource approaches for the subtasks of ASR, Speech Enhancement (SE), and Sentiment Analysis (SA) was done. Considering the SE task, a Wave-U-net model was successfully implemented, being able to denoise raw audio signals with average Segmental Signal-to-Noise Ratio (SSNR) scores above 3.9 and a 0.8% increase in Signal-to-Noise Ratio (SNR). For the SA task, a domain specific sentiment lexicon based on the SentiWordNet3.0 dictionary was developed for the European Portuguese language. Then, using a linear Support Vector Machine (SVM) baseline model for benchmarking, the Lex2Sent model was modified and its performance improved for binary classification of sentiment in the corresponding transcriptions, which achieved an F1 macro score of 0.584. Lastly, limitations are discussed with the goal of developing the remaining unsupervised ASR system. |
|---|