Publicação

Optimizing document reranking in a retrieval-augmented generation pipeline for Portuguese legal research

Ver documento

Detalhes bibliográficos
Resumo:This study explores RAG systems tailored to the Portuguese legal domain, highlighting challenges in underrepresented languages. Fixed-size chunking strategies, particularly TokenTextSplitter, were found to be most effective, while more advanced techniques like Recursive and Semantic splitting showed little benefits. Larger chunk sizes improved retrieval accuracy and answer quality, though the impact of chunk overlap remains inconclusive. Although reranking techniques have been shown to improve retrieval in previous research this may only be true for large and diverse datasets.
Autores principais:Wollny, Carolyn Svea
Assunto:Retrieval-Augmented Generation RAG Large Language Models LLM Artificial Intelligence AI Hallucination Question answering RAG evaluation Vector store Chunking Legal AI Document reranking Relevance ranking Legal information retrieval Portuguese legal retrieval
Ano:2025
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:This study explores RAG systems tailored to the Portuguese legal domain, highlighting challenges in underrepresented languages. Fixed-size chunking strategies, particularly TokenTextSplitter, were found to be most effective, while more advanced techniques like Recursive and Semantic splitting showed little benefits. Larger chunk sizes improved retrieval accuracy and answer quality, though the impact of chunk overlap remains inconclusive. Although reranking techniques have been shown to improve retrieval in previous research this may only be true for large and diverse datasets.