Publicação

Leveraging Large Language Models (LLMs) to Structure Free-Text Medical Data

Detalhes bibliográficos
Resumo:	To address challenges in utilizing free-text radiology reports, this study developed and validated a privacy-preserving synthetic dataset of 1,000 rectal cancer MRI reports to serve as a benchmark for large language models (LLMs). The creation involved a three-stage pipeline: generating structured reports using clinical rules, converting them to narratives with GPT-4o, and using DeepSeek R1 for assessment and correction. Crucially, initial rules were revised based on feedback from an expert radiologist who identified clinical and anatomical implausibilities, such as incorrect tumour localizations and anatomically impossible invasions. To better simulate real-world data, the final dataset was diversified by programmatically introducing missing values, nulls, and measurement inconsistencies, such as converting centimeters to millimeters with up to 40% probability for certain fields. Six LLMs were evaluated on their ability to extract 26 structured fields from the reports, using a constrainedgeneration framework (Outlines) to ensure syntactically valid JSON output. While the proprietary Gemini 2.0 Flash model served as a strong performance benchmark, the top openweight models, Phi-4 (8-bit quantized) and the domain-specialized BioMistral-7B-DARE (8-bit quantized), proved to be highly viable alternatives, achieving 85-95% of Gemini's capability. These models even surpassed the benchmark in some areas; for instance, both achieved perfect F1 scores of 1.000 for N-stage classification, compared to Gemini's 0.996. The strong performance of the smaller, 8-bit quantized BioMistral model suggests that domain-specific training can be more effective than simply scaling up general-purpose models for medical tasks. The most challenging field for all models was "anal sphincter complex invasion," where universally poor performance (F1 scores from 0.181 to 0.303) was attributed to the extraction schema's inability to support the conditional clinical logic required for accurate assessment. This work concludes that quantized open-weight models are clinically acceptable alternatives for automated report structuring, especially where data privacy and cost are priorities, but highlights the need for extraction frameworks to better incorporate complex clinical reasoning.
Autores principais:	Batrakova, Maria
Assunto:	Large Language Models Artificial Intelligence Synthetic data Radiology Reports Structured Reporting SDG 3 - Good health and well-being SDG 4 - Quality education SDG 9 - Industry, innovation and infrastructure
Ano:	2025
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso embargado
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

Descrição
Resumo:	To address challenges in utilizing free-text radiology reports, this study developed and validated a privacy-preserving synthetic dataset of 1,000 rectal cancer MRI reports to serve as a benchmark for large language models (LLMs). The creation involved a three-stage pipeline: generating structured reports using clinical rules, converting them to narratives with GPT-4o, and using DeepSeek R1 for assessment and correction. Crucially, initial rules were revised based on feedback from an expert radiologist who identified clinical and anatomical implausibilities, such as incorrect tumour localizations and anatomically impossible invasions. To better simulate real-world data, the final dataset was diversified by programmatically introducing missing values, nulls, and measurement inconsistencies, such as converting centimeters to millimeters with up to 40% probability for certain fields. Six LLMs were evaluated on their ability to extract 26 structured fields from the reports, using a constrainedgeneration framework (Outlines) to ensure syntactically valid JSON output. While the proprietary Gemini 2.0 Flash model served as a strong performance benchmark, the top openweight models, Phi-4 (8-bit quantized) and the domain-specialized BioMistral-7B-DARE (8-bit quantized), proved to be highly viable alternatives, achieving 85-95% of Gemini's capability. These models even surpassed the benchmark in some areas; for instance, both achieved perfect F1 scores of 1.000 for N-stage classification, compared to Gemini's 0.996. The strong performance of the smaller, 8-bit quantized BioMistral model suggests that domain-specific training can be more effective than simply scaling up general-purpose models for medical tasks. The most challenging field for all models was "anal sphincter complex invasion," where universally poor performance (F1 scores from 0.181 to 0.303) was attributed to the extraction schema's inability to support the conditional clinical logic required for accurate assessment. This work concludes that quantized open-weight models are clinically acceptable alternatives for automated report structuring, especially where data privacy and cost are priorities, but highlights the need for extraction frameworks to better incorporate complex clinical reasoning.