Publicação
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
| Resumo: | This dissertation explores state-of-the-art deep learning models for document retrieval in biomedical research, using the Exposome-Explorer database as a case study, which contains manually curated entries on biomarkers of exposure to environmental risk factors for various diseases. Previous works have employed simple machine learning algorithms to reduce expert workload by enhancing the accuracy and efficiency of document retrieval. In this dissertation traditional document retrieval methods, such as BM25, are evaluated alongside transformer models like MonoBERT, DistilBERT, and PubMedBERT, to assess their suitability for the task. Results demonstrate that PubMedBERT, pre-trained on biomedical text, offers the best performance in retrieving relevant documents, with BM25 contributing significantly to initial dataset refinement. However, challenges such as curated data variability and variability in precision and recall persist, particularly with smaller datasets for which fewer training examples are available like pollutant biomarkers. This research represents a step forward in automating and refining the curation of biomedical databases, ensuring faster and more reliable results. Future work will involve applying the trained models to the latest version of the Exposome-Explorer database and enhancing BM25 with RM3 query expansion for improved document ranking. Additional optimization of the models will be explored to address performance variability and improve overall retrieval accuracy across different biomarker datasets. |
|---|---|
| Autores principais: | Ramos, Diogo Luís Embaixador |
| Assunto: | DEEP LEARNING DOCUMENT RETRIEVAL DATABASE CURATION BIOMEDICAL LITERATURE INFORMATION RETRIEVAL |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | This dissertation explores state-of-the-art deep learning models for document retrieval in biomedical research, using the Exposome-Explorer database as a case study, which contains manually curated entries on biomarkers of exposure to environmental risk factors for various diseases. Previous works have employed simple machine learning algorithms to reduce expert workload by enhancing the accuracy and efficiency of document retrieval. In this dissertation traditional document retrieval methods, such as BM25, are evaluated alongside transformer models like MonoBERT, DistilBERT, and PubMedBERT, to assess their suitability for the task. Results demonstrate that PubMedBERT, pre-trained on biomedical text, offers the best performance in retrieving relevant documents, with BM25 contributing significantly to initial dataset refinement. However, challenges such as curated data variability and variability in precision and recall persist, particularly with smaller datasets for which fewer training examples are available like pollutant biomarkers. This research represents a step forward in automating and refining the curation of biomedical databases, ensuring faster and more reliable results. Future work will involve applying the trained models to the latest version of the Exposome-Explorer database and enhancing BM25 with RM3 query expansion for improved document ranking. Additional optimization of the models will be explored to address performance variability and improve overall retrieval accuracy across different biomarker datasets. |
|---|