Publicação

Capturing the narrative : deep learning models for comics sequences

Detalhes bibliográficos
Resumo:	Comics represent the complexway humans can communicate and expose ideas, which pose additional challenges for image-to-text deep learning models. In this project, we investigate howmultimodal deep learning architectures performin describing a comics vignette. We investigate howcurrent State-of-the-Art models (GIT and BLIP-2) are able to describe the narrative in 4-images comics sequence from a dataset we created. We find that some prompting can produce acceptable results. We also assess how to propagate information across the sequence’s images, by adding to prompts the previous outputs of the images from the same sequence. The results show limited improvements from this strategy. While the overall meaning of the predicted descriptions is close to the semantic space of the real descriptions, they are still far away from human-level descriptions. Therefore we propose several future experiments, where we highlight reinforcement learning to train a large language model as a policy function for prompt generation.
Autores principais:	Marouvo, Gonçalo Ventura Lourenço
Assunto:	Comics Computer vision Image captioning Multimodal Deep Learning Models Prompt engineering
Ano:	2025
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Instituto Politécnico de Coimbra
Idioma:	inglês
Origem:	Instituto Politécnico de Coimbra

Descrição
Resumo:	Comics represent the complexway humans can communicate and expose ideas, which pose additional challenges for image-to-text deep learning models. In this project, we investigate howmultimodal deep learning architectures performin describing a comics vignette. We investigate howcurrent State-of-the-Art models (GIT and BLIP-2) are able to describe the narrative in 4-images comics sequence from a dataset we created. We find that some prompting can produce acceptable results. We also assess how to propagate information across the sequence’s images, by adding to prompts the previous outputs of the images from the same sequence. The results show limited improvements from this strategy. While the overall meaning of the predicted descriptions is close to the semantic space of the real descriptions, they are still far away from human-level descriptions. Therefore we propose several future experiments, where we highlight reinforcement learning to train a large language model as a policy function for prompt generation.