Publicação

Map Text Extraction and Parsing using Optical Character Recognition (OCR) for Facilitating Map Reproducibility Assessment

Ver documento

Detalhes bibliográficos
Resumo:Reproducibility stands as a fundamental element in promoting transparency and openness in scientific publications and in geoscientific research as well. Figures, particularly maps, integrated into geoscientific research play a significant role in visualizing and representing crucial scientific results; thus, they should be reproducible. However, the assessment of map reproducibility for determining the success of map reproduction is limited due to the absence of standard metrics, criteria, and tools. In this study, a novel web-based application is developed to facilitate the map reproducibility assessment process based on textual elements of the map. The tool integrates an open source optical character recognition (OCR) technology for text extraction from maps and proposes a comprehensive comparative analysis workflow consisting of assessment criteria such as the text similarity between the extracted texts using fuzzy string matching techniques, the overlap ratio between the bounding boxes associated with the texts using the Jaccard index (intersection over union), and the Euclidean distance between the bounding boxes for effective map reproducibility assessment. The tool is validated and evaluated using real-world datasets and reveals its effectiveness compared to the existing map comparison methods in terms of accessibility, interoperability, and flexibility to accommodate diverse file sizes, image resolutions, and file types. As a result, the tool was found to be usable with a SUS score of 69.33 and useful for researchers and GIS professionals to extract and assess textual elements from maps. In addition, the study demonstrates promising results in the effective utilization of OCR technology for accurate text extraction from maps, even with the lowest map image resolution (60 dpi) and smallest font sizes (7 pt).
Autores principais:Mulaw, Yohannes Abrha
Assunto:reproducibility map reproducibility map reproducibility assessment optical character recognition (OCR) text analysis fuzzy matching
Ano:2024
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:Reproducibility stands as a fundamental element in promoting transparency and openness in scientific publications and in geoscientific research as well. Figures, particularly maps, integrated into geoscientific research play a significant role in visualizing and representing crucial scientific results; thus, they should be reproducible. However, the assessment of map reproducibility for determining the success of map reproduction is limited due to the absence of standard metrics, criteria, and tools. In this study, a novel web-based application is developed to facilitate the map reproducibility assessment process based on textual elements of the map. The tool integrates an open source optical character recognition (OCR) technology for text extraction from maps and proposes a comprehensive comparative analysis workflow consisting of assessment criteria such as the text similarity between the extracted texts using fuzzy string matching techniques, the overlap ratio between the bounding boxes associated with the texts using the Jaccard index (intersection over union), and the Euclidean distance between the bounding boxes for effective map reproducibility assessment. The tool is validated and evaluated using real-world datasets and reveals its effectiveness compared to the existing map comparison methods in terms of accessibility, interoperability, and flexibility to accommodate diverse file sizes, image resolutions, and file types. As a result, the tool was found to be usable with a SUS score of 69.33 and useful for researchers and GIS professionals to extract and assess textual elements from maps. In addition, the study demonstrates promising results in the effective utilization of OCR technology for accurate text extraction from maps, even with the lowest map image resolution (60 dpi) and smallest font sizes (7 pt).