Publicação
Multi-Turn LLM Red Teaming via Token Steering and RL-Trained Policy
| Resumo: | Large Language Models (LLMs) excel in natural language processing, but remain vulnerable to multi-turn adversarial attacks that exploit contextual dependencies. Traditional red-teaming focuses on single-prompt attacks, missing nuanced, persistent vulnerabilities. Existing defenses struggle to counteract these evolving multi-turn threats, while human-driven red-teaming is resource-intensive and inconsistent. Previous research has explored single-turn jailbreak attacks, adversarial prompting techniques, and defensive strategies such as reinforcement learning from human feedback (RLHF) and fine-tuning. However, these approaches remain limited in addressing dynamic, multi-turn attack strategies. Recent work on automated adversarial testing has highlighted the effectiveness of iterative attack planning, yet a scalable, RL-driven approach remains unexplored. This thesis introduces Red-DAT, a reinforcement learning (RL)-based framework for automated multi-turn adversarial attacks. The approach leverages jailbreak detection feedback to refine attack strategies, combining prefix tuning with reinforcement learning to steer LLMs in multi-turn adversarial dialogues. The framework is integrated into RedTWIZ, a larger red-teaming system developed for the Amazon Trusted AI Challenge, where it was evaluated in competitive tournament settings against state-of-the-art, safetytuned LLMs. Experiments show that Red-DAT improves attack success and adapts to different defender models, revealing weaknesses in current safety mechanisms. Integrated into RedTWIZ, it contributed to the team’s second-place finish in the international competition. This work advances multi-turn red-teaming as a practical approach to AI safety by systematically uncovering vulnerabilities in LLMs and informing the design of stronger defenses. |
|---|---|
| Autores principais: | Paz, Henrique Mestre dos Santos Sacadura |
| Assunto: | Large Language Models Multi-Turn Attacks Reinforcement Learning Adversarial AI Red-Teaming AI Safety |
| Ano: | 2025 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
| Resumo: | Large Language Models (LLMs) excel in natural language processing, but remain vulnerable to multi-turn adversarial attacks that exploit contextual dependencies. Traditional red-teaming focuses on single-prompt attacks, missing nuanced, persistent vulnerabilities. Existing defenses struggle to counteract these evolving multi-turn threats, while human-driven red-teaming is resource-intensive and inconsistent. Previous research has explored single-turn jailbreak attacks, adversarial prompting techniques, and defensive strategies such as reinforcement learning from human feedback (RLHF) and fine-tuning. However, these approaches remain limited in addressing dynamic, multi-turn attack strategies. Recent work on automated adversarial testing has highlighted the effectiveness of iterative attack planning, yet a scalable, RL-driven approach remains unexplored. This thesis introduces Red-DAT, a reinforcement learning (RL)-based framework for automated multi-turn adversarial attacks. The approach leverages jailbreak detection feedback to refine attack strategies, combining prefix tuning with reinforcement learning to steer LLMs in multi-turn adversarial dialogues. The framework is integrated into RedTWIZ, a larger red-teaming system developed for the Amazon Trusted AI Challenge, where it was evaluated in competitive tournament settings against state-of-the-art, safetytuned LLMs. Experiments show that Red-DAT improves attack success and adapts to different defender models, revealing weaknesses in current safety mechanisms. Integrated into RedTWIZ, it contributed to the team’s second-place finish in the international competition. This work advances multi-turn red-teaming as a practical approach to AI safety by systematically uncovering vulnerabilities in LLMs and informing the design of stronger defenses. |
|---|