Publicação

Evaluating the quality of requirements using Generative AI techniques

Ver documento

Detalhes bibliográficos
Resumo:High-quality requirements are critical to the success of software projects, yet ambiguity, inconsistency, and incompleteness of software specifications continue to challenge requirements engineering. Manual reviews are often costly and subjective, motivating the search for automated approaches that can complement expert analysis. Recent advances in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs) offer promising capabilities to evaluate natural language requirements and propose refinements. This thesis investigates how LLMs can be systematically applied to assess the quality of requirements specifications and user stories, namely Consistency, Unambiguosness, Correctness, Feasability, and Completness. A multi-phase framework was developed that combines structured prompts, ensemble reasoning, and advanced prompting techniques, such as Retrieval-Augmented Generation (RAG) and Rephrase-and-Respond (RAR). The framework integrates established quality criteria from the literature, linguistic patterns indicative of poor requirements, and context-sensitive evaluation strategies. Its performance was tested across several real-world datasets, including crisis management and automotive crash systems, and validated through experiments with multiple LLMs (GPT, Gemini, and Grok). To further validate the framework, a survey was conducted with 18 professionals and graduate students in computer science and related fields. Participants assessed the same requirements and provided feedback on the clarity, accuracy, and trustworthiness of the framework. Quantitative analysis revealed that LLMs tended to assign slightly higher ratings than humans. The overall evaluation was strong, with a Mean Absolute Error of 0.82. Participants rated the framework positively in terms of clarity and relevance, though they expressed more caution regarding its direct use in critical real-world decision-making.
Autores principais:Fonseca, Vasco Fernandes
Assunto:requirements engineering requirements quality generative artificial intelligence large language models prompt engineering
Ano:2025
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:High-quality requirements are critical to the success of software projects, yet ambiguity, inconsistency, and incompleteness of software specifications continue to challenge requirements engineering. Manual reviews are often costly and subjective, motivating the search for automated approaches that can complement expert analysis. Recent advances in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs) offer promising capabilities to evaluate natural language requirements and propose refinements. This thesis investigates how LLMs can be systematically applied to assess the quality of requirements specifications and user stories, namely Consistency, Unambiguosness, Correctness, Feasability, and Completness. A multi-phase framework was developed that combines structured prompts, ensemble reasoning, and advanced prompting techniques, such as Retrieval-Augmented Generation (RAG) and Rephrase-and-Respond (RAR). The framework integrates established quality criteria from the literature, linguistic patterns indicative of poor requirements, and context-sensitive evaluation strategies. Its performance was tested across several real-world datasets, including crisis management and automotive crash systems, and validated through experiments with multiple LLMs (GPT, Gemini, and Grok). To further validate the framework, a survey was conducted with 18 professionals and graduate students in computer science and related fields. Participants assessed the same requirements and provided feedback on the clarity, accuracy, and trustworthiness of the framework. Quantitative analysis revealed that LLMs tended to assign slightly higher ratings than humans. The overall evaluation was strong, with a Mean Absolute Error of 0.82. Participants rated the framework positively in terms of clarity and relevance, though they expressed more caution regarding its direct use in critical real-world decision-making.