Red-Blue Visual Auto Defender: Automated Visual Jailbreak Generation and Explainable Defenses
Washington University in St. Louis — Fall 2025
This project, built for Washington University's CSE 5519 course with teammates Stuart Aldrich and Mohammad Rouie Miab, addresses visual prompt injection (VPI) — image-embedded text instructions that can hijack vision-language models used as agent cores, such as an email assistant. Rather than relying on another ML model as a defense, which is itself attackable and hard to interpret, the team built an automated red-blue pipeline: it generates attack images by overlaying malicious instructions onto benign images, tests them against a target VLM (Gemma-3-4b-it), and uses semantic analysis of the model's responses to determine whether an attack succeeded.
For each successful attack, the pipeline automatically generates a deterministic, OCR-based Python defense script — auditable, rule-based code rather than another opaque model — and validates it with accuracy, precision, and recall metrics in an iterative refinement loop. The approach was demonstrated on a realistic email-agent attack scenario.
Highlights
- Automated generation of visual prompt-injection (jailbreak) attack images
- Attack evaluation against Gemma-3-4b-it with semantic response analysis
- Auto-generated, deterministic OCR-based Python defenses instead of black-box ML
- Iterative red-blue refinement loop with accuracy/precision/recall validation