Publicação

Multi-Turn LLM Red Teaming via Token Steering and RL-Trained Policy

Detalhes bibliográficos
Resumo:	Large Language Models (LLMs) excel in natural language processing, but remain vulnerable to multi-turn adversarial attacks that exploit contextual dependencies. Traditional red-teaming focuses on single-prompt attacks, missing nuanced, persistent vulnerabilities. Existing defenses struggle to counteract these evolving multi-turn threats, while human-driven red-teaming is resource-intensive and inconsistent. Previous research has explored single-turn jailbreak attacks, adversarial prompting techniques, and defensive strategies such as reinforcement learning from human feedback (RLHF) and fine-tuning. However, these approaches remain limited in addressing dynamic, multi-turn attack strategies. Recent work on automated adversarial testing has highlighted the effectiveness of iterative attack planning, yet a scalable, RL-driven approach remains unexplored. This thesis introduces Red-DAT, a reinforcement learning (RL)-based framework for automated multi-turn adversarial attacks. The approach leverages jailbreak detection feedback to refine attack strategies, combining prefix tuning with reinforcement learning to steer LLMs in multi-turn adversarial dialogues. The framework is integrated into RedTWIZ, a larger red-teaming system developed for the Amazon Trusted AI Challenge, where it was evaluated in competitive tournament settings against state-of-the-art, safetytuned LLMs. Experiments show that Red-DAT improves attack success and adapts to different defender models, revealing weaknesses in current safety mechanisms. Integrated into RedTWIZ, it contributed to the team’s second-place finish in the international competition. This work advances multi-turn red-teaming as a practical approach to AI safety by systematically uncovering vulnerabilities in LLMs and informing the design of stronger defenses.
Autores principais:	Paz, Henrique Mestre dos Santos Sacadura
Assunto:	Large Language Models Multi-Turn Attacks Reinforcement Learning Adversarial AI Red-Teaming AI Safety
Ano:	2025
País:	Portugal
Tipo de documento:	dissertação de mestrado
Tipo de acesso:	acesso aberto
Instituição associada:	Universidade Nova de Lisboa
Idioma:	inglês
Origem:	Repositório Institucional da UNL

Registos relacionados

RedTreez: A Tree-Based Framework for Conversational Red Teaming
por: Paulo, Iago Miguel do Nascimento
Publicado em: (2025)

RedTWIZ Guard: Adversarial Jailbreak Detection in Multi-Turn Conversations
por: Pina, Daniel Lopes
Publicado em: (2025)

Adaptive Attack Planning for Red-Teaming Large Language Models
por: Horal, Artur
Publicado em: (2025)

An LLM-Based Conversational Agent for Multi-Operator Public Transport Information
por: Berthele, Louis Jacob
Publicado em: (2026)

Operationalising Intelligent Augmentation: A multi-layer analysis of LLM use in organizations
por: Andrade, Miguel Filipe Jacinto
Publicado em: (2026)

Cyber Red team bot with RL
por: Neto,João Pires Ferreira
Publicado em: (2025)

Defending the defender: adversarial learning based defending strategy for learning based security methods in Cyber-Physical Systems (CPS)
por: Sheikh, Zakir Ahmad
Publicado em: (2023)

Tourists and artificial intelligence-LLM interaction: The power of forgiveness
por: Loureiro, S. M. C.
Publicado em: (2025)

Adaptative Perturbation Patterns: Realistic Adversarial Learning for Robust Intrusion Detection
por: Vitorino, João
Publicado em: (2022)

Navegação turn-by-turn em Android
por: Henriques, Luís Miguel dos Santos
Publicado em: (2019)

GENERATIVE AI MUTABILITY IN CYBERSECURITY: A BIBLIOMETRIC REVIEW
por: Oliveira, Pedro
Publicado em: (2026)

Constrained adversarial learning for automated software testing: a literature review
por: Vitorino, João
Publicado em: (2025)

Generative AI for growth hacking: how startups use generative AI in their growth strategies
por: Rezazadeh, Arash
Publicado em: (2025)

Attack Detection in Cyber-Physical Production Systems using the Deterministic Dendritic Cell Algorithm
por: Pinto, Rui
Publicado em: (2020)

Towards Adversarial Realism and Robust Learning for IoT Intrusion Detection and Classification
por: Vitorino, João
Publicado em: (2023)

Do we need to trust our ai-teammate to perform? Investigating the mediating role of team trust in the relationship between team composition and team performance
por: Friedrich, Lilian Marie
Publicado em: (2023)

Generative AI for growth hacking
por: Rezazadeh, Arash
Publicado em: (2025)

Deep Learning Based Communication: an Adversarial Approach
por: Emami, Yousef
Publicado em: (2019)

Pivotal or peripheral: assessing the role of generative artificial intelligence in accelerating entrepreneurial success - a study of enhancing software engineering
por: Bakakis, Andreas
Publicado em: (2024)

Simplifying complex insurance product management with AI
por: Teixeira, João
Publicado em: (2025)

Shaping the future of technical education in healthcare: development of a new biomed certification program. What customers need and geyond
por: Pilz, Nils Jonathan
Publicado em: (2024)

Adversarial Attacks to Classification Systems
por: Leal, João Miguel Gouveia
Publicado em: (2022)

A Genetic Algorithm Framework for Jailbreaking Large Language Models [poster]
por: Bonin, Lorenzo
Publicado em: (2025)

Motion planning using DeepRL, applied to an industrial forklift pallet picking problem
por: Godinho, Francisco José Pessoa
Publicado em: (2025)

Navigating the Risks and Safeguards of Large Language Models (LLMs): Addressing Data Privacy, Security, and Ethical Concerns
por: Jarząbkowski, Mikołaj
Publicado em: (2025)

Constrained Adversarial Learning and its applicability to Automated Software Testing: a systematic review
por: Vitorino, João
Publicado em: (2023)

Artificial intelligence in recruitment: a multivocal review of benefits, challenges, and strategies
por: Trovão, Hugo
Publicado em: (2025)

Implications of causality in artificial intelligence
por: Cavique, Luís
Publicado em: (2024)

Navigating the AI-powered workplace: generative AI and generation Z in service organizations-What are the challenges and opportunities of enabling collaboration between generation Z employees and generative AI tools in service organizations?
por: Beumer, Jan-Sina
Publicado em: (2026)

GENERATIVE AI FOR ENTERPRISE ARCHITECTURE SUPPORT: A CHATBOT SOLUTION FOR IT QUERIES
por: Anjo, Lucas Matias Neto
Publicado em: (2025)

Creativity support tool for sustainability: an LLM-Assisted Creative Process
por: Rodrigues, Ana Isabel Mendonça
Publicado em: (2026)

LLM-based cost-aware task scheduling for cloud computing systems
por: Pei, Haoran
Publicado em: (2025)

Exploring Trust and Literacy in Engagement With Generative AI and Science Information Behavior
por: Agergaard, Torben E.
Publicado em: (2026)

Online dating solutions and turning points motivations in Portugal
por: Sepúlveda, Rita
Publicado em: (2020)

Analyzing the significance of AI in digital marketing teams : a case study on Gocomo and the AI assistant Alfred
por: Ivankovic, Antonela
Publicado em: (2024)

CodeAssert: Multi-Provider LLM Evaluation for Automated Unit Test Generation
por: GONÇALVES, JORGE ANDRÉ DE OLIVEIRA
Publicado em: (2025)

Rethinking Media Users in the Age of AI and Algorithmic Mediation
por: Jung, Jaemin
Publicado em: (2025)

Harmony in dissonance : the impact of AI transparency on consumer acceptance in the music industry
por: Kleinert, Anna-Sophie Maria
Publicado em: (2025)

Personalism in Generative AI deployment: deciding ethically when human creative expression is at stake
por: Fioravante, Rosa
Publicado em: (2025)

Inteligência Artificial para Tradutores: Proposta de Conteúdo Essencial
por: Li Wenyan
Publicado em: (2026)