Publicação
WAVe
| Resumo: | Automatic speech recognition for low-resource languages often relies on synthetic utterances to augment limited speech data; these utterances are generated by pairing large-language-model transcripts with neural text-to-speech (TTS) audio. However, indiscriminate incorporation of synthetic audio can reduce training efficiency and introduce errors such as unnatural speech patterns. We introduce WAVe, a model that verifies word-to-audio frame correspondence by aligning text representations with audio features. To evaluate WAVe, we generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and a TTS system. We created four training subsets per language with varying proportions of synthetic data and fine-tuned three Whisper models of different sizes. For Portuguese, our high-quality 29k-sample subset achieved a 7.9% word error rate with Whisper-Large-v3, outperforming a recent competitive 55k-sample model trained under identical conditions. Experiments on Dutch showed similarly consistent improvements. WAVe reduces the number of training steps by 34%, substantially lowering computational cost while improving ASR quality. Cross-domain evaluation on the Multilingual LibriSpeech benchmark demonstrates that WAVe-based filtering reduces WER from 13.54% to 6.89%. These results establish WAVe as an effective quality control mechanism for synthetic data pipelines, enabling the identification and removal of poorly synthesized audio-text pairs prior to ASR fine-tuning. |
|---|---|
| Autores principais: | Perezhohin, Yuriy |
| Outros Autores: | Castelli, Mauro |
| Assunto: | Automatic speech recognition Word alignment Synthetic data Deep learning Control and Systems Engineering Software Theoretical Computer Science Computer Science Applications Information Systems and Management Artificial Intelligence SDG 9 - Industry, Innovation, and Infrastructure |
| Ano: | 2026 |
| País: | Portugal |
| Tipo de documento: | artigo |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | Automatic speech recognition for low-resource languages often relies on synthetic utterances to augment limited speech data; these utterances are generated by pairing large-language-model transcripts with neural text-to-speech (TTS) audio. However, indiscriminate incorporation of synthetic audio can reduce training efficiency and introduce errors such as unnatural speech patterns. We introduce WAVe, a model that verifies word-to-audio frame correspondence by aligning text representations with audio features. To evaluate WAVe, we generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and a TTS system. We created four training subsets per language with varying proportions of synthetic data and fine-tuned three Whisper models of different sizes. For Portuguese, our high-quality 29k-sample subset achieved a 7.9% word error rate with Whisper-Large-v3, outperforming a recent competitive 55k-sample model trained under identical conditions. Experiments on Dutch showed similarly consistent improvements. WAVe reduces the number of training steps by 34%, substantially lowering computational cost while improving ASR quality. Cross-domain evaluation on the Multilingual LibriSpeech benchmark demonstrates that WAVe-based filtering reduces WER from 13.54% to 6.89%. These results establish WAVe as an effective quality control mechanism for synthetic data pipelines, enabling the identification and removal of poorly synthesized audio-text pairs prior to ASR fine-tuning. |
|---|