Publicação

Comparing Multimodal LLMS and Traditional Neural Networks for Table Extraction From PDFs and Images: An Evaluation of Structure and Content Extraction from Table in Images

Detalhes bibliográficos
Resumo:	Extracting tables from images, such as cropped sections from PDFs or screenshots of spreadsheets, remains a challenging task due to the variability in table layouts and the absence of structural metadata. Traditional OCR-based systems, like the Table Transformer (TATR) combined with PaddleOCR, rely on explicit structure detection and text recognition. More recently, multimodal Large Language Models (LLMs) such as GPT-4o, GPT-4o Mini, Granite Vision, and PHI-3 Vision have introduced an alternative approach, generating structured outputs directly from images without relying on traditional OCR pipelines. This thesis compares both strategies using 2,000 annotated tables from the PubTables-1M dataset, evenly split between simple and complex cases. Evaluation focuses on structural accuracy, content fidelity, and layout robustness, with GriTSCon used as a unified metric. Results show that GPT-4o performs best among multimodal LLMs on simple tables (GriTSCon F1 = 89.6%), while TATR-OCR outperforms all models on complex tables (GriTSCon F1 = 85.5%). GPT-4o achieves higher cell-content accuracy at exact-match thresholds on simple layouts but experiences a performance drop of 17 points when handling complex structures. In contrast, TATR-OCR maintains high accuracy across both scenarios, with low failure rates and stable structure recognition. These findings highlight the limitations of current multimodal LLMs in complex visual tasks and support the potential of hybrid approaches that combine the strengths of OCR-based systems with LLM reasoning capabilities.
Autores principais:	Nunes, Guilherme Guerra Marques
Assunto:	Table Extraction Optical Character Recognition Multimodal Large Language Models Table Transformer GriTSCon SDG 4 - Quality education SDG 9 - Industry, innovation and infrastructure SDG 16 - Peace, justice and strong institutions SDG 17 - Partnerships for the goals Extração de Tabelas Reconhecimento Óptico de Caracteres Modelos Multimodais de Linguagem Table Transformer GriTSCon
Ano:	2025
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

Descrição
Resumo:	Extracting tables from images, such as cropped sections from PDFs or screenshots of spreadsheets, remains a challenging task due to the variability in table layouts and the absence of structural metadata. Traditional OCR-based systems, like the Table Transformer (TATR) combined with PaddleOCR, rely on explicit structure detection and text recognition. More recently, multimodal Large Language Models (LLMs) such as GPT-4o, GPT-4o Mini, Granite Vision, and PHI-3 Vision have introduced an alternative approach, generating structured outputs directly from images without relying on traditional OCR pipelines. This thesis compares both strategies using 2,000 annotated tables from the PubTables-1M dataset, evenly split between simple and complex cases. Evaluation focuses on structural accuracy, content fidelity, and layout robustness, with GriTSCon used as a unified metric. Results show that GPT-4o performs best among multimodal LLMs on simple tables (GriTSCon F1 = 89.6%), while TATR-OCR outperforms all models on complex tables (GriTSCon F1 = 85.5%). GPT-4o achieves higher cell-content accuracy at exact-match thresholds on simple layouts but experiences a performance drop of 17 points when handling complex structures. In contrast, TATR-OCR maintains high accuracy across both scenarios, with low failure rates and stable structure recognition. These findings highlight the limitations of current multimodal LLMs in complex visual tasks and support the potential of hybrid approaches that combine the strengths of OCR-based systems with LLM reasoning capabilities.