[2602.22227] Dynamic Adversarial Reinforcement Learning for Robust Multimodal Large Language Models

Summarize this content to 100 words:

[Submitted on 24 Jan 2026 (v1), last revised 4 Mar 2026 (this version, v3)]

View a PDF of the paper titled Dynamic Adversarial Reinforcement Learning for Robust Multimodal Large Language Models, by Yicheng Bao and 5 other authors
View PDF
HTML (experimental)

Abstract:Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are prohibitively expensive to scale and impose a ceiling on model robustness. We introduce \textbf{AOT-SFT}, a large-scale adversarial dataset for bootstrapping MLLM robustness. Building on this, we propose \textbf{AOT (Adversarial Opponent Training)}, a self-play framework that forges MLLM robustness by creating its own training data. Our method orchestrates a co-evolution between an image-editing Attacker and a Defender MLLM, where the Attacker generates a diverse and dynamic curriculum of image manipulations, forcing the Defender to adapt and improve. Extensive experiments demonstrate that AOT enhances the Defender’s perceptual robustness and reduces hallucinations, establishing a scalable paradigm for training more reliable MLLMs.

Submission history From: Yicheng Bao [view email] [v1]
Sat, 24 Jan 2026 03:47:29 UTC (14,820 KB)
[v2]
Fri, 27 Feb 2026 05:20:42 UTC (14,820 KB)
[v3]
Wed, 4 Mar 2026 11:04:46 UTC (14,820 KB)

Source link

[2602.22227] Dynamic Adversarial Reinforcement Learning for Robust Multimodal Large Language Models

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

YuanLab AI Releases Yuan 3.0 Ultra: A Flagship Multimodal MoE Foundation Model, Built for Stronger Intelligence and Unrivaled Efficiency

Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

The Temporal Markov Transition Field

Google’s deepfake hunter sees what you can’t—even in videos without faces