Publicação

WAVe

Ver documento

Detalhes bibliográficos
Resumo:Automatic speech recognition for low-resource languages often relies on synthetic utterances to augment limited speech data; these utterances are generated by pairing large-language-model transcripts with neural text-to-speech (TTS) audio. However, indiscriminate incorporation of synthetic audio can reduce training efficiency and introduce errors such as unnatural speech patterns. We introduce WAVe, a model that verifies word-to-audio frame correspondence by aligning text representations with audio features. To evaluate WAVe, we generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and a TTS system. We created four training subsets per language with varying proportions of synthetic data and fine-tuned three Whisper models of different sizes. For Portuguese, our high-quality 29k-sample subset achieved a 7.9% word error rate with Whisper-Large-v3, outperforming a recent competitive 55k-sample model trained under identical conditions. Experiments on Dutch showed similarly consistent improvements. WAVe reduces the number of training steps by 34%, substantially lowering computational cost while improving ASR quality. Cross-domain evaluation on the Multilingual LibriSpeech benchmark demonstrates that WAVe-based filtering reduces WER from 13.54% to 6.89%. These results establish WAVe as an effective quality control mechanism for synthetic data pipelines, enabling the identification and removal of poorly synthesized audio-text pairs prior to ASR fine-tuning.
Autores principais:Perezhohin, Yuriy
Outros Autores:Castelli, Mauro
Assunto:Automatic speech recognition Word alignment Synthetic data Deep learning Control and Systems Engineering Software Theoretical Computer Science Computer Science Applications Information Systems and Management Artificial Intelligence SDG 9 - Industry, Innovation, and Infrastructure
Ano:2026
País:Portugal
Tipo de documento:artigo
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:Automatic speech recognition for low-resource languages often relies on synthetic utterances to augment limited speech data; these utterances are generated by pairing large-language-model transcripts with neural text-to-speech (TTS) audio. However, indiscriminate incorporation of synthetic audio can reduce training efficiency and introduce errors such as unnatural speech patterns. We introduce WAVe, a model that verifies word-to-audio frame correspondence by aligning text representations with audio features. To evaluate WAVe, we generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and a TTS system. We created four training subsets per language with varying proportions of synthetic data and fine-tuned three Whisper models of different sizes. For Portuguese, our high-quality 29k-sample subset achieved a 7.9% word error rate with Whisper-Large-v3, outperforming a recent competitive 55k-sample model trained under identical conditions. Experiments on Dutch showed similarly consistent improvements. WAVe reduces the number of training steps by 34%, substantially lowering computational cost while improving ASR quality. Cross-domain evaluation on the Multilingual LibriSpeech benchmark demonstrates that WAVe-based filtering reduces WER from 13.54% to 6.89%. These results establish WAVe as an effective quality control mechanism for synthetic data pipelines, enabling the identification and removal of poorly synthesized audio-text pairs prior to ASR fine-tuning.