Red-Blue Visual Auto Defender: Automated Visual Jailbreak Generation and Explainable Defenses

Washington University in St. Louis — Aug 2025 – Dec 2025

This project, built for Washington University's CSE 5519 course, addresses visual prompt injection (VPI) — image-embedded text instructions that can hijack vision-language models used as agent cores, such as an email assistant. Rather than relying on another ML model as a defense, which is itself attackable and hard to interpret, I built an automated red-blue pipeline: it generates attack images by overlaying malicious instructions onto benign images, tests them against a target VLM (Gemma-3-4b-it), and uses semantic analysis of the model's responses to determine whether an attack succeeded.

For each successful attack, the pipeline automatically generates a deterministic, OCR-based Python defense script — auditable, rule-based code rather than another opaque model — and validates it with accuracy, precision, and recall metrics in an iterative refinement loop. The approach was demonstrated on a realistic email-agent attack scenario.

Highlights

Automated generation of visual prompt-injection (jailbreak) attack images
Attack evaluation against Gemma-3-4b-it with semantic response analysis
Auto-generated, deterministic OCR-based Python defenses instead of black-box ML
Iterative red-blue refinement loop with accuracy/precision/recall validation

Technologies Applied

PythonPyTorchTransformersGemma-3PillowOpenCVOCROpenAI APIAnthropic API

GitHub