Publicação
RedTreez: A Tree-Based Framework for Conversational Red Teaming
| Resumo: | Large Language Models (LLMs) are increasingly being deployed in safety-critical contexts, making adversarial robustness a central concern. While single-turn jailbreaks have been widely studied, the multi-turn setting — where adversaries exploit sustained dialogue to bypass safeguards — remains underexplored despite posing significant risks. This thesis addresses this gap through the design and evaluation of RedTreez, a structured attack system for planning and executing adaptive multi-turn adversarial strategies. Developed as one of the core attack systems within the broader RedTWIZ framework [7], RedTreez implements a novel tree-based paradigm for organizing and navigating complex dialogue attacks. It formalizes high-level attack trajectories in a dynamic tree structure and introduces pruning and adaptation protocols to make exploration efficient. The framework leverages LLMs for context-aware prompt generation; furthermore, we developed a specialized fine-tuned attacker model that demonstrated an enhanced ability to generate effective, strategy-aligned prompts without relying on the explicit guidance required by larger foundation models. RedTreez was evaluated within the Amazon Trusted AI Challenge, a rigorous benchmark involving state-of-the-art defense models and human-annotated judgments. The system consistently achieved over 80% attack success rate across all evaluated systems, including Claude 3.5 Sonnet [2]. Our empirical analysis of these results exposes critical multi-turn vulnerabilities, specifically weaknesses in refusal consistency and defense adaptation, demonstrating the robustness of our approach. These findings advance the understanding of multi-turn adversarial dynamics and provide a foundation for building more robust and trustworthy conversational AI systems. |
|---|---|
| Autores principais: | Paulo, Iago Miguel do Nascimento |
| Assunto: | Large Language Models Multi-turn Jailbreaking Adversarial Attacks Red Teaming Model Alignment RedTreez Framework |
| Ano: | 2025 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | Large Language Models (LLMs) are increasingly being deployed in safety-critical contexts, making adversarial robustness a central concern. While single-turn jailbreaks have been widely studied, the multi-turn setting — where adversaries exploit sustained dialogue to bypass safeguards — remains underexplored despite posing significant risks. This thesis addresses this gap through the design and evaluation of RedTreez, a structured attack system for planning and executing adaptive multi-turn adversarial strategies. Developed as one of the core attack systems within the broader RedTWIZ framework [7], RedTreez implements a novel tree-based paradigm for organizing and navigating complex dialogue attacks. It formalizes high-level attack trajectories in a dynamic tree structure and introduces pruning and adaptation protocols to make exploration efficient. The framework leverages LLMs for context-aware prompt generation; furthermore, we developed a specialized fine-tuned attacker model that demonstrated an enhanced ability to generate effective, strategy-aligned prompts without relying on the explicit guidance required by larger foundation models. RedTreez was evaluated within the Amazon Trusted AI Challenge, a rigorous benchmark involving state-of-the-art defense models and human-annotated judgments. The system consistently achieved over 80% attack success rate across all evaluated systems, including Claude 3.5 Sonnet [2]. Our empirical analysis of these results exposes critical multi-turn vulnerabilities, specifically weaknesses in refusal consistency and defense adaptation, demonstrating the robustness of our approach. These findings advance the understanding of multi-turn adversarial dynamics and provide a foundation for building more robust and trustworthy conversational AI systems. |
|---|