Publicação

WAVe

Detalhes bibliográficos
Resumo:	Automatic speech recognition for low-resource languages often relies on synthetic utterances to augment limited speech data; these utterances are generated by pairing large-language-model transcripts with neural text-to-speech (TTS) audio. However, indiscriminate incorporation of synthetic audio can reduce training efficiency and introduce errors such as unnatural speech patterns. We introduce WAVe, a model that verifies word-to-audio frame correspondence by aligning text representations with audio features. To evaluate WAVe, we generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and a TTS system. We created four training subsets per language with varying proportions of synthetic data and fine-tuned three Whisper models of different sizes. For Portuguese, our high-quality 29k-sample subset achieved a 7.9% word error rate with Whisper-Large-v3, outperforming a recent competitive 55k-sample model trained under identical conditions. Experiments on Dutch showed similarly consistent improvements. WAVe reduces the number of training steps by 34%, substantially lowering computational cost while improving ASR quality. Cross-domain evaluation on the Multilingual LibriSpeech benchmark demonstrates that WAVe-based filtering reduces WER from 13.54% to 6.89%. These results establish WAVe as an effective quality control mechanism for synthetic data pipelines, enabling the identification and removal of poorly synthesized audio-text pairs prior to ASR fine-tuning.
Autores principais:	Perezhohin, Yuriy
Outros Autores:	Castelli, Mauro
Assunto:	Automatic speech recognition Word alignment Synthetic data Deep learning Control and Systems Engineering Software Theoretical Computer Science Computer Science Applications Information Systems and Management Artificial Intelligence SDG 9 - Industry, Innovation, and Infrastructure
Ano:	2026
País:	Portugal
Tipo de documento:	artigo
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

Descrição
Resumo:	Automatic speech recognition for low-resource languages often relies on synthetic utterances to augment limited speech data; these utterances are generated by pairing large-language-model transcripts with neural text-to-speech (TTS) audio. However, indiscriminate incorporation of synthetic audio can reduce training efficiency and introduce errors such as unnatural speech patterns. We introduce WAVe, a model that verifies word-to-audio frame correspondence by aligning text representations with audio features. To evaluate WAVe, we generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and a TTS system. We created four training subsets per language with varying proportions of synthetic data and fine-tuned three Whisper models of different sizes. For Portuguese, our high-quality 29k-sample subset achieved a 7.9% word error rate with Whisper-Large-v3, outperforming a recent competitive 55k-sample model trained under identical conditions. Experiments on Dutch showed similarly consistent improvements. WAVe reduces the number of training steps by 34%, substantially lowering computational cost while improving ASR quality. Cross-domain evaluation on the Multilingual LibriSpeech benchmark demonstrates that WAVe-based filtering reduces WER from 13.54% to 6.89%. These results establish WAVe as an effective quality control mechanism for synthetic data pipelines, enabling the identification and removal of poorly synthesized audio-text pairs prior to ASR fine-tuning.