Publicação

RedTreez: A Tree-Based Framework for Conversational Red Teaming

Ver documento

Detalhes bibliográficos
Resumo:Large Language Models (LLMs) are increasingly being deployed in safety-critical contexts, making adversarial robustness a central concern. While single-turn jailbreaks have been widely studied, the multi-turn setting — where adversaries exploit sustained dialogue to bypass safeguards — remains underexplored despite posing significant risks. This thesis addresses this gap through the design and evaluation of RedTreez, a structured attack system for planning and executing adaptive multi-turn adversarial strategies. Developed as one of the core attack systems within the broader RedTWIZ framework [7], RedTreez implements a novel tree-based paradigm for organizing and navigating complex dialogue attacks. It formalizes high-level attack trajectories in a dynamic tree structure and introduces pruning and adaptation protocols to make exploration efficient. The framework leverages LLMs for context-aware prompt generation; furthermore, we developed a specialized fine-tuned attacker model that demonstrated an enhanced ability to generate effective, strategy-aligned prompts without relying on the explicit guidance required by larger foundation models. RedTreez was evaluated within the Amazon Trusted AI Challenge, a rigorous benchmark involving state-of-the-art defense models and human-annotated judgments. The system consistently achieved over 80% attack success rate across all evaluated systems, including Claude 3.5 Sonnet [2]. Our empirical analysis of these results exposes critical multi-turn vulnerabilities, specifically weaknesses in refusal consistency and defense adaptation, demonstrating the robustness of our approach. These findings advance the understanding of multi-turn adversarial dynamics and provide a foundation for building more robust and trustworthy conversational AI systems.
Autores principais:Paulo, Iago Miguel do Nascimento
Assunto:Large Language Models Multi-turn Jailbreaking Adversarial Attacks Red Teaming Model Alignment RedTreez Framework
Ano:2025
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:Large Language Models (LLMs) are increasingly being deployed in safety-critical contexts, making adversarial robustness a central concern. While single-turn jailbreaks have been widely studied, the multi-turn setting — where adversaries exploit sustained dialogue to bypass safeguards — remains underexplored despite posing significant risks. This thesis addresses this gap through the design and evaluation of RedTreez, a structured attack system for planning and executing adaptive multi-turn adversarial strategies. Developed as one of the core attack systems within the broader RedTWIZ framework [7], RedTreez implements a novel tree-based paradigm for organizing and navigating complex dialogue attacks. It formalizes high-level attack trajectories in a dynamic tree structure and introduces pruning and adaptation protocols to make exploration efficient. The framework leverages LLMs for context-aware prompt generation; furthermore, we developed a specialized fine-tuned attacker model that demonstrated an enhanced ability to generate effective, strategy-aligned prompts without relying on the explicit guidance required by larger foundation models. RedTreez was evaluated within the Amazon Trusted AI Challenge, a rigorous benchmark involving state-of-the-art defense models and human-annotated judgments. The system consistently achieved over 80% attack success rate across all evaluated systems, including Claude 3.5 Sonnet [2]. Our empirical analysis of these results exposes critical multi-turn vulnerabilities, specifically weaknesses in refusal consistency and defense adaptation, demonstrating the robustness of our approach. These findings advance the understanding of multi-turn adversarial dynamics and provide a foundation for building more robust and trustworthy conversational AI systems.