Publicação

Multi-Turn LLM Red Teaming via Token Steering and RL-Trained Policy

Ver documento

Detalhes bibliográficos
Resumo:Large Language Models (LLMs) excel in natural language processing, but remain vulnerable to multi-turn adversarial attacks that exploit contextual dependencies. Traditional red-teaming focuses on single-prompt attacks, missing nuanced, persistent vulnerabilities. Existing defenses struggle to counteract these evolving multi-turn threats, while human-driven red-teaming is resource-intensive and inconsistent. Previous research has explored single-turn jailbreak attacks, adversarial prompting techniques, and defensive strategies such as reinforcement learning from human feedback (RLHF) and fine-tuning. However, these approaches remain limited in addressing dynamic, multi-turn attack strategies. Recent work on automated adversarial testing has highlighted the effectiveness of iterative attack planning, yet a scalable, RL-driven approach remains unexplored. This thesis introduces Red-DAT, a reinforcement learning (RL)-based framework for automated multi-turn adversarial attacks. The approach leverages jailbreak detection feedback to refine attack strategies, combining prefix tuning with reinforcement learning to steer LLMs in multi-turn adversarial dialogues. The framework is integrated into RedTWIZ, a larger red-teaming system developed for the Amazon Trusted AI Challenge, where it was evaluated in competitive tournament settings against state-of-the-art, safetytuned LLMs. Experiments show that Red-DAT improves attack success and adapts to different defender models, revealing weaknesses in current safety mechanisms. Integrated into RedTWIZ, it contributed to the team’s second-place finish in the international competition. This work advances multi-turn red-teaming as a practical approach to AI safety by systematically uncovering vulnerabilities in LLMs and informing the design of stronger defenses.
Autores principais:Paz, Henrique Mestre dos Santos Sacadura
Assunto:Large Language Models Multi-Turn Attacks Reinforcement Learning Adversarial AI Red-Teaming AI Safety
Ano:2025
País:Portugal
Tipo de documento:dissertação de mestrado
Tipo de acesso:acesso aberto
Instituição associada:Universidade Nova de Lisboa
Idioma:inglês
Origem:Repositório Institucional da UNL
Descrição
Resumo:Large Language Models (LLMs) excel in natural language processing, but remain vulnerable to multi-turn adversarial attacks that exploit contextual dependencies. Traditional red-teaming focuses on single-prompt attacks, missing nuanced, persistent vulnerabilities. Existing defenses struggle to counteract these evolving multi-turn threats, while human-driven red-teaming is resource-intensive and inconsistent. Previous research has explored single-turn jailbreak attacks, adversarial prompting techniques, and defensive strategies such as reinforcement learning from human feedback (RLHF) and fine-tuning. However, these approaches remain limited in addressing dynamic, multi-turn attack strategies. Recent work on automated adversarial testing has highlighted the effectiveness of iterative attack planning, yet a scalable, RL-driven approach remains unexplored. This thesis introduces Red-DAT, a reinforcement learning (RL)-based framework for automated multi-turn adversarial attacks. The approach leverages jailbreak detection feedback to refine attack strategies, combining prefix tuning with reinforcement learning to steer LLMs in multi-turn adversarial dialogues. The framework is integrated into RedTWIZ, a larger red-teaming system developed for the Amazon Trusted AI Challenge, where it was evaluated in competitive tournament settings against state-of-the-art, safetytuned LLMs. Experiments show that Red-DAT improves attack success and adapts to different defender models, revealing weaknesses in current safety mechanisms. Integrated into RedTWIZ, it contributed to the team’s second-place finish in the international competition. This work advances multi-turn red-teaming as a practical approach to AI safety by systematically uncovering vulnerabilities in LLMs and informing the design of stronger defenses.