Document details

Large language models overcome the challenges of unstructured text data in ecology

Author(s): Castro, Andry ; Pinto, João ; Reino, Luís ; Pipek, Pavel ; Capinha, César

Date: 2024

Persistent ID: http://hdl.handle.net/10362/172961

Origin: Repositório Institucional da UNL

Subject(s): AI; Automation; Data integration; GPT; LLaMA; Unstructured data; Ecology, Evolution, Behavior and Systematics; Ecology; Modelling and Simulation; Ecological Modelling; Computer Science Applications; Computational Theory and Mathematics; Applied Mathematics


Description

Funding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \u2018 CEECIND/00445/2017 \u2019 under the \u2018Stimulus of Scientific Employment\u2014Individual Support\u2019 and by FCT \u2018UNRAVEL\u2019 project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The Authors

The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.

Document Type Journal article
Language English
Contributor(s) Vector borne diseases and pathogens (VBD); Instituto de Higiene e Medicina Tropical (IHMT); Global Health and Tropical Medicine (GHTM); RUN
facebook logo  linkedin logo  twitter logo 
mendeley logo

Related documents

No related documents