Publicação
Multi-Turn LLM Red Teaming via Token Steering and RL-Trained Policy
| Resumo: | Large Language Models (LLMs) excel in natural language processing, but remain vulnerable to multi-turn adversarial attacks that exploit contextual dependencies. Traditional red-teaming focuses on single-prompt attacks, missing nuanced, persistent vulnerabilities. Existing defenses struggle to counteract these evolving multi-turn threats, while human-driven red-teaming is resource-intensive and inconsistent. Previous research has explored single-turn jailbreak attacks, adversarial prompting techniques, and defensive strategies such as reinforcement learning from human feedback (RLHF) and fine-tuning. However, these approaches remain limited in addressing dynamic, multi-turn attack strategies. Recent work on automated adversarial testing has highlighted the effectiveness of iterative attack planning, yet a scalable, RL-driven approach remains unexplored. This thesis introduces Red-DAT, a reinforcement learning (RL)-based framework for automated multi-turn adversarial attacks. The approach leverages jailbreak detection feedback to refine attack strategies, combining prefix tuning with reinforcement learning to steer LLMs in multi-turn adversarial dialogues. The framework is integrated into RedTWIZ, a larger red-teaming system developed for the Amazon Trusted AI Challenge, where it was evaluated in competitive tournament settings against state-of-the-art, safetytuned LLMs. Experiments show that Red-DAT improves attack success and adapts to different defender models, revealing weaknesses in current safety mechanisms. Integrated into RedTWIZ, it contributed to the team’s second-place finish in the international competition. This work advances multi-turn red-teaming as a practical approach to AI safety by systematically uncovering vulnerabilities in LLMs and informing the design of stronger defenses. |
|---|---|
| Autores principais: | Paz, Henrique Mestre dos Santos Sacadura |
| Assunto: | Large Language Models Multi-Turn Attacks Reinforcement Learning Adversarial AI Red-Teaming AI Safety |
| Ano: | 2025 |
| País: | Portugal |
| Tipo de documento: | dissertação de mestrado |
| Tipo de acesso: | acesso aberto |
| Instituição associada: | Universidade Nova de Lisboa |
| Idioma: | inglês |
| Origem: | Repositório Institucional da UNL |
Registos relacionados
school RedTreez: A Tree-Based Framework for Conversational Red Teaming
por: Paulo, Iago Miguel do Nascimento
Publicado em: (2025)
por: Paulo, Iago Miguel do Nascimento
Publicado em: (2025)
school RedTWIZ Guard: Adversarial Jailbreak Detection in Multi-Turn Conversations
por: Pina, Daniel Lopes
Publicado em: (2025)
por: Pina, Daniel Lopes
Publicado em: (2025)
school Adaptive Attack Planning for Red-Teaming Large Language Models
por: Horal, Artur
Publicado em: (2025)
por: Horal, Artur
Publicado em: (2025)
school An LLM-Based Conversational Agent for Multi-Operator Public Transport Information
por: Berthele, Louis Jacob
Publicado em: (2026)
por: Berthele, Louis Jacob
Publicado em: (2026)
school Operationalising Intelligent Augmentation: A multi-layer analysis of LLM use in organizations
por: Andrade, Miguel Filipe Jacinto
Publicado em: (2026)
por: Andrade, Miguel Filipe Jacinto
Publicado em: (2026)
school Cyber Red team bot with RL
por: Neto,João Pires Ferreira
Publicado em: (2025)
por: Neto,João Pires Ferreira
Publicado em: (2025)
article Defending the defender: adversarial learning based defending strategy for learning based security methods in Cyber-Physical Systems (CPS)
por: Sheikh, Zakir Ahmad
Publicado em: (2023)
por: Sheikh, Zakir Ahmad
Publicado em: (2023)
article Tourists and artificial intelligence-LLM interaction: The power of forgiveness
por: Loureiro, S. M. C.
Publicado em: (2025)
por: Loureiro, S. M. C.
Publicado em: (2025)
article Adaptative Perturbation Patterns: Realistic Adversarial Learning for Robust Intrusion Detection
por: Vitorino, João
Publicado em: (2022)
por: Vitorino, João
Publicado em: (2022)
school Navegação turn-by-turn em Android
por: Henriques, Luís Miguel dos Santos
Publicado em: (2019)
por: Henriques, Luís Miguel dos Santos
Publicado em: (2019)
text_fields GENERATIVE AI MUTABILITY IN CYBERSECURITY: A BIBLIOMETRIC REVIEW
por: Oliveira, Pedro
Publicado em: (2026)
por: Oliveira, Pedro
Publicado em: (2026)
article Constrained adversarial learning for automated software testing: a literature review
por: Vitorino, João
Publicado em: (2025)
por: Vitorino, João
Publicado em: (2025)
science Generative AI for growth hacking: how startups use generative AI in their growth strategies
por: Rezazadeh, Arash
Publicado em: (2025)
por: Rezazadeh, Arash
Publicado em: (2025)
groups Attack Detection in Cyber-Physical Production Systems using the Deterministic Dendritic Cell Algorithm
por: Pinto, Rui
Publicado em: (2020)
por: Pinto, Rui
Publicado em: (2020)
article Towards Adversarial Realism and Robust Learning for IoT Intrusion Detection and Classification
por: Vitorino, João
Publicado em: (2023)
por: Vitorino, João
Publicado em: (2023)
school Do we need to trust our ai-teammate to perform? Investigating the mediating role of team trust in the relationship between team composition and team performance
por: Friedrich, Lilian Marie
Publicado em: (2023)
por: Friedrich, Lilian Marie
Publicado em: (2023)
article Generative AI for growth hacking
por: Rezazadeh, Arash
Publicado em: (2025)
por: Rezazadeh, Arash
Publicado em: (2025)
category Deep Learning Based Communication: an Adversarial Approach
por: Emami, Yousef
Publicado em: (2019)
por: Emami, Yousef
Publicado em: (2019)
school Pivotal or peripheral: assessing the role of generative artificial intelligence in accelerating entrepreneurial success - a study of enhancing software engineering
por: Bakakis, Andreas
Publicado em: (2024)
por: Bakakis, Andreas
Publicado em: (2024)
article Simplifying complex insurance product management with AI
por: Teixeira, João
Publicado em: (2025)
por: Teixeira, João
Publicado em: (2025)
school Shaping the future of technical education in healthcare: development of a new biomed certification program. What customers need and geyond
por: Pilz, Nils Jonathan
Publicado em: (2024)
por: Pilz, Nils Jonathan
Publicado em: (2024)
school Adversarial Attacks to Classification Systems
por: Leal, João Miguel Gouveia
Publicado em: (2022)
por: Leal, João Miguel Gouveia
Publicado em: (2022)
groups A Genetic Algorithm Framework for Jailbreaking Large Language Models [poster]
por: Bonin, Lorenzo
Publicado em: (2025)
por: Bonin, Lorenzo
Publicado em: (2025)
school Motion planning using DeepRL, applied to an industrial forklift pallet picking problem
por: Godinho, Francisco José Pessoa
Publicado em: (2025)
por: Godinho, Francisco José Pessoa
Publicado em: (2025)
school Navigating the Risks and Safeguards of Large Language Models (LLMs): Addressing Data Privacy, Security, and Ethical Concerns
por: Jarząbkowski, Mikołaj
Publicado em: (2025)
por: Jarząbkowski, Mikołaj
Publicado em: (2025)
article Constrained Adversarial Learning and its applicability to Automated Software Testing: a systematic review
por: Vitorino, João
Publicado em: (2023)
por: Vitorino, João
Publicado em: (2023)
article Artificial intelligence in recruitment: a multivocal review of benefits, challenges, and strategies
por: Trovão, Hugo
Publicado em: (2025)
por: Trovão, Hugo
Publicado em: (2025)
article Implications of causality in artificial intelligence
por: Cavique, Luís
Publicado em: (2024)
por: Cavique, Luís
Publicado em: (2024)
school Navigating the AI-powered workplace: generative AI and generation Z in service organizations-What are the challenges and opportunities of enabling collaboration between generation Z employees and generative AI tools in service organizations?
por: Beumer, Jan-Sina
Publicado em: (2026)
por: Beumer, Jan-Sina
Publicado em: (2026)
school GENERATIVE AI FOR ENTERPRISE ARCHITECTURE SUPPORT: A CHATBOT SOLUTION FOR IT QUERIES
por: Anjo, Lucas Matias Neto
Publicado em: (2025)
por: Anjo, Lucas Matias Neto
Publicado em: (2025)
school Creativity support tool for sustainability: an LLM-Assisted Creative Process
por: Rodrigues, Ana Isabel Mendonça
Publicado em: (2026)
por: Rodrigues, Ana Isabel Mendonça
Publicado em: (2026)
article LLM-based cost-aware task scheduling for cloud computing systems
por: Pei, Haoran
Publicado em: (2025)
por: Pei, Haoran
Publicado em: (2025)
article Exploring Trust and Literacy in Engagement With Generative AI and Science Information Behavior
por: Agergaard, Torben E.
Publicado em: (2026)
por: Agergaard, Torben E.
Publicado em: (2026)
article Online dating solutions and turning points motivations in Portugal
por: Sepúlveda, Rita
Publicado em: (2020)
por: Sepúlveda, Rita
Publicado em: (2020)
school Analyzing the significance of AI in digital marketing teams : a case study on Gocomo and the AI assistant Alfred
por: Ivankovic, Antonela
Publicado em: (2024)
por: Ivankovic, Antonela
Publicado em: (2024)
school CodeAssert: Multi-Provider LLM Evaluation for Automated Unit Test Generation
por: GONÇALVES, JORGE ANDRÉ DE OLIVEIRA
Publicado em: (2025)
por: GONÇALVES, JORGE ANDRÉ DE OLIVEIRA
Publicado em: (2025)
article Rethinking Media Users in the Age of AI and Algorithmic Mediation
por: Jung, Jaemin
Publicado em: (2025)
por: Jung, Jaemin
Publicado em: (2025)
school Harmony in dissonance : the impact of AI transparency on consumer acceptance in the music industry
por: Kleinert, Anna-Sophie Maria
Publicado em: (2025)
por: Kleinert, Anna-Sophie Maria
Publicado em: (2025)
article Personalism in Generative AI deployment: deciding ethically when human creative expression is at stake
por: Fioravante, Rosa
Publicado em: (2025)
por: Fioravante, Rosa
Publicado em: (2025)
school Inteligência Artificial para Tradutores: Proposta de Conteúdo Essencial
por: Li Wenyan
Publicado em: (2026)
por: Li Wenyan
Publicado em: (2026)
Registos relacionados
-
school RedTreez: A Tree-Based Framework for Conversational Red Teaming
por: Paulo, Iago Miguel do Nascimento
Publicado em: (2025) -
school RedTWIZ Guard: Adversarial Jailbreak Detection in Multi-Turn Conversations
por: Pina, Daniel Lopes
Publicado em: (2025) -
school Adaptive Attack Planning for Red-Teaming Large Language Models
por: Horal, Artur
Publicado em: (2025) -
school An LLM-Based Conversational Agent for Multi-Operator Public Transport Information
por: Berthele, Louis Jacob
Publicado em: (2026) -
school Operationalising Intelligent Augmentation: A multi-layer analysis of LLM use in organizations
por: Andrade, Miguel Filipe Jacinto
Publicado em: (2026)