Publicação
Towards Inclusive Communication. Applying AR Glasses and VLMs for Real-World and Context-Aware Sign Language Translation
| Resumo: | More than 70 million Deaf or Hard-of-Hearing (DHH) individuals use sign language as a means of communication. Still, existing Sign Language Translation (SLT) systems struggle with real-world applicability due to data scarcity, lack of context-awareness, and limited integration with portable, hands-free technologies. This work proposes a novel, gloss-free and context-aware SLT system tailored for real-world scenarios, combining a Vision-Language Model (VLM) and Augmented Reality (AR) glasses to provide accurate translations in everyday situations while also focusing on being efficient and sustainable. In contrast to most recent systems that rely on large-scale resources, this work was de- veloped under strict hardware limits, which motivated the design of a resource-optimized fine-tuning strategy that reduced training costs by about 40%. Similarly, this also led to a focus on targeted and lightweight architectural changes, resulting in the Motion- CLSAdapter, a module that greatly improved temporal motion modeling. Conversational context was incorporated through the creation of a small synthetic dataset and prompting techniques, while cloud-based model deployment and the AR glasses enabled hands-free and interactive use of the application. The results show that these optimizations led to stronger translation metrics, improved clustering of signs, and more coherent dialogue-level translations, while latency remained within the threshold defined for natural interaction. Despite challenges in robustness under real-world capture conditions and error propagation in extended contexts, the prototype demonstrates the feasibility of delivering context-aware SLT through wearable technology. Importantly, it also represents the first work to integrate SLT with AR glasses. Overall, this dissertation provides a solid step towards more inclusive communication, demonstrates that meaningful progress can be achieved even under strict compute limits, and lays a foundation for future systems that support natural communication between signers and non-signers in everyday environments. |
|---|---|
| Autores principais: | Arruda, Pedro Guilherme Moreira |
| Assunto: | Sign Language Translation Large Language Models Vision-Language Models Augmented Reality AR Glasses |
| Ano: | 2025 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | More than 70 million Deaf or Hard-of-Hearing (DHH) individuals use sign language as a means of communication. Still, existing Sign Language Translation (SLT) systems struggle with real-world applicability due to data scarcity, lack of context-awareness, and limited integration with portable, hands-free technologies. This work proposes a novel, gloss-free and context-aware SLT system tailored for real-world scenarios, combining a Vision-Language Model (VLM) and Augmented Reality (AR) glasses to provide accurate translations in everyday situations while also focusing on being efficient and sustainable. In contrast to most recent systems that rely on large-scale resources, this work was de- veloped under strict hardware limits, which motivated the design of a resource-optimized fine-tuning strategy that reduced training costs by about 40%. Similarly, this also led to a focus on targeted and lightweight architectural changes, resulting in the Motion- CLSAdapter, a module that greatly improved temporal motion modeling. Conversational context was incorporated through the creation of a small synthetic dataset and prompting techniques, while cloud-based model deployment and the AR glasses enabled hands-free and interactive use of the application. The results show that these optimizations led to stronger translation metrics, improved clustering of signs, and more coherent dialogue-level translations, while latency remained within the threshold defined for natural interaction. Despite challenges in robustness under real-world capture conditions and error propagation in extended contexts, the prototype demonstrates the feasibility of delivering context-aware SLT through wearable technology. Importantly, it also represents the first work to integrate SLT with AR glasses. Overall, this dissertation provides a solid step towards more inclusive communication, demonstrates that meaningful progress can be achieved even under strict compute limits, and lays a foundation for future systems that support natural communication between signers and non-signers in everyday environments. |
|---|