Publication
Evaluating the quality of requirements using Generative AI techniques
| Summary: | High-quality requirements are critical to the success of software projects, yet ambiguity, inconsistency, and incompleteness of software specifications continue to challenge requirements engineering. Manual reviews are often costly and subjective, motivating the search for automated approaches that can complement expert analysis. Recent advances in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs) offer promising capabilities to evaluate natural language requirements and propose refinements. This thesis investigates how LLMs can be systematically applied to assess the quality of requirements specifications and user stories, namely Consistency, Unambiguosness, Correctness, Feasability, and Completness. A multi-phase framework was developed that combines structured prompts, ensemble reasoning, and advanced prompting techniques, such as Retrieval-Augmented Generation (RAG) and Rephrase-and-Respond (RAR). The framework integrates established quality criteria from the literature, linguistic patterns indicative of poor requirements, and context-sensitive evaluation strategies. Its performance was tested across several real-world datasets, including crisis management and automotive crash systems, and validated through experiments with multiple LLMs (GPT, Gemini, and Grok). To further validate the framework, a survey was conducted with 18 professionals and graduate students in computer science and related fields. Participants assessed the same requirements and provided feedback on the clarity, accuracy, and trustworthiness of the framework. Quantitative analysis revealed that LLMs tended to assign slightly higher ratings than humans. The overall evaluation was strong, with a Mean Absolute Error of 0.82. Participants rated the framework positively in terms of clarity and relevance, though they expressed more caution regarding its direct use in critical real-world decision-making. |
|---|---|
| Main Authors: | Fonseca, Vasco Fernandes |
| Subject: | requirements engineering requirements quality generative artificial intelligence large language models prompt engineering |
| Year: | 2025 |
| Country: | Portugal |
| Document type: | master thesis |
| Access type: | open access |
| Associated institution: | Universidade Nova de Lisboa |
| Language: | English |
| Origin: | Repositório Institucional da UNL |
| Summary: | High-quality requirements are critical to the success of software projects, yet ambiguity, inconsistency, and incompleteness of software specifications continue to challenge requirements engineering. Manual reviews are often costly and subjective, motivating the search for automated approaches that can complement expert analysis. Recent advances in Generative Artificial Intelligence (GAI) and Large Language Models (LLMs) offer promising capabilities to evaluate natural language requirements and propose refinements. This thesis investigates how LLMs can be systematically applied to assess the quality of requirements specifications and user stories, namely Consistency, Unambiguosness, Correctness, Feasability, and Completness. A multi-phase framework was developed that combines structured prompts, ensemble reasoning, and advanced prompting techniques, such as Retrieval-Augmented Generation (RAG) and Rephrase-and-Respond (RAR). The framework integrates established quality criteria from the literature, linguistic patterns indicative of poor requirements, and context-sensitive evaluation strategies. Its performance was tested across several real-world datasets, including crisis management and automotive crash systems, and validated through experiments with multiple LLMs (GPT, Gemini, and Grok). To further validate the framework, a survey was conducted with 18 professionals and graduate students in computer science and related fields. Participants assessed the same requirements and provided feedback on the clarity, accuracy, and trustworthiness of the framework. Quantitative analysis revealed that LLMs tended to assign slightly higher ratings than humans. The overall evaluation was strong, with a Mean Absolute Error of 0.82. Participants rated the framework positively in terms of clarity and relevance, though they expressed more caution regarding its direct use in critical real-world decision-making. |
|---|