Publicação
Assessing the quality and reliability of ChatGPT’s responses to radiotherapy-related patient queries: GPT-3.5 versus GPT-4
| Resumo: | Patients frequently resort to the Internet to access cancer information. Nevertheless, these online websites often need more content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, signifies a potential paradigm shift in how cancer patients can access vast medical information. However, given that ChatGPT was not explicitly trained for oncology-related inquiries, the quality of the information it provides still needs to be verified. Evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and delay appropriate treatment. Objective: This study aims to evaluate the quality and reliability of ChatGPT’s responses to standard patient queries about radiotherapy, comparing the performance of GPT-3.5 and GPT-4. Methods: Forty commonly asked radiotherapy questions were selected and inserted into both versions. Responses were evaluated by six radiotherapy experts using a General Quality Score (GQS), assessed for consistency and similarity using the cosine similarity score, and analyzed for readability using the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Statistical analysis was performed using the Mann-Whitney test. Results: GPT-4 demonstrated superior performance, with higher GQS and a complete absence of lower scores compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The cosine similarity score indicated substantial similarity and consistency in responses from both versions. Readability scores for both versions were considered college-level, with GPT-4 scoring slightly better in FRES (35.55) and FKGL (12.71) compared to GPT-3.5 (30.68 and 13.53, respectively). Both versions’ responses were deemed challenging for the public to read. Conclusions: While GPT-4 generates more accurate and reliable responses than GPT-3.5, both models present readability challenges for the public. ChatGPT reveals potential as a valuable resource for addressing common patient queries related to radiotherapy. However, it is crucial to acknowledge its limitations, including the risks of misinformation and readability issues. |
|---|---|
| Autores principais: | Grilo, Ana |
| Outros Autores: | Marques, Catarina; Corte-Real, Maria; Carolino, Elisabete; Caetano, Marco |
| Assunto: | Radiotherapy Artificial intelligence ChatGPT Large language model Patient information |
| Ano: | 2024 |
| País: | Portugal |
| Tipo de documento: | preprint |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Instituto Politécnico de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Científico do Instituto Politécnico de Lisboa |
| Resumo: | Patients frequently resort to the Internet to access cancer information. Nevertheless, these online websites often need more content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, signifies a potential paradigm shift in how cancer patients can access vast medical information. However, given that ChatGPT was not explicitly trained for oncology-related inquiries, the quality of the information it provides still needs to be verified. Evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and delay appropriate treatment. Objective: This study aims to evaluate the quality and reliability of ChatGPT’s responses to standard patient queries about radiotherapy, comparing the performance of GPT-3.5 and GPT-4. Methods: Forty commonly asked radiotherapy questions were selected and inserted into both versions. Responses were evaluated by six radiotherapy experts using a General Quality Score (GQS), assessed for consistency and similarity using the cosine similarity score, and analyzed for readability using the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Statistical analysis was performed using the Mann-Whitney test. Results: GPT-4 demonstrated superior performance, with higher GQS and a complete absence of lower scores compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The cosine similarity score indicated substantial similarity and consistency in responses from both versions. Readability scores for both versions were considered college-level, with GPT-4 scoring slightly better in FRES (35.55) and FKGL (12.71) compared to GPT-3.5 (30.68 and 13.53, respectively). Both versions’ responses were deemed challenging for the public to read. Conclusions: While GPT-4 generates more accurate and reliable responses than GPT-3.5, both models present readability challenges for the public. ChatGPT reveals potential as a valuable resource for addressing common patient queries related to radiotherapy. However, it is crucial to acknowledge its limitations, including the risks of misinformation and readability issues. |
|---|