Publicação

SLVideo: A Sign Language Video Moment Retrieval Framework

Ver documento

Detalhes bibliográficos
Resumo:Sign Language Recognition has been an increasingly studied and developed subject throughout the years to help deaf and hard-of-hearing individuals in their social interactions in everyday life. These technologies employ manual sign recognition algorithms; however, the majority of them lack the capacity to recognise facial expressions, which are also an essential part of sign language as they allow the speaker to add expressiveness to their dialogue or even change the meaning of certain manual signs. For Portuguese Sign Language Recognition software this is no exception. This dissertation introduces SLVideo, a video moment retrieval system for Sign Language videos that incorporates facial expressions, addressing the gap in existing technology by focusing on both hand and facial signs. The system extracts embedding representations for the hand and face signs from video frames to capture the language signs in their entirety. This enables users to search for a specific sign language video segment with text queries or to search by similar sign language videos. To evaluate this system, a collection of eight hours of annotated Portuguese Sign Language videos is used as the dataset, and a CLIP model is used to generate the embeddings. The initial results are promising in a zero-shot setting. Additionally, SLVideo allows users to edit existing annotations and create new ones, making it a collaborative tool for annotators working with the same videos.
Autores principais:Martins, Gonçalo Vinagre
Assunto:Sign Language Recognition Facial expressions Portuguese Sign Language Video moment retrieval
Ano:2024
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:Sign Language Recognition has been an increasingly studied and developed subject throughout the years to help deaf and hard-of-hearing individuals in their social interactions in everyday life. These technologies employ manual sign recognition algorithms; however, the majority of them lack the capacity to recognise facial expressions, which are also an essential part of sign language as they allow the speaker to add expressiveness to their dialogue or even change the meaning of certain manual signs. For Portuguese Sign Language Recognition software this is no exception. This dissertation introduces SLVideo, a video moment retrieval system for Sign Language videos that incorporates facial expressions, addressing the gap in existing technology by focusing on both hand and facial signs. The system extracts embedding representations for the hand and face signs from video frames to capture the language signs in their entirety. This enables users to search for a specific sign language video segment with text queries or to search by similar sign language videos. To evaluate this system, a collection of eight hours of annotated Portuguese Sign Language videos is used as the dataset, and a CLIP model is used to generate the embeddings. The initial results are promising in a zero-shot setting. Additionally, SLVideo allows users to edit existing annotations and create new ones, making it a collaborative tool for annotators working with the same videos.