Skip to content

AI Red / Blue / Purple Team — FAQ

Adapted from classic cybersecurity team-colour terminology and applied to AI/LLM systems. These terms are increasingly common in 2025–2026 hiring conversations, AI safety policy, and regulated-industry quality conversations.


The Team Colours at a Glance

Team Stance Goal Typical Output
Red Team Offensive Break it. Find vulnerabilities, bypass guardrails, elicit unsafe/wrong behaviour Attack reports, exploit corpora, vulnerability inventories
Blue Team Defensive Protect it. Build guardrails, detection, response, monitoring Guardrails, filters, monitoring dashboards, incident playbooks
Purple Team Collaborative Improve the loop. Red + Blue working together; every attack becomes a defence Faster detect→fix cycle; regression tests for attacks
Green Team Engineering Build it right. Engineering team that owns the system being tested Secure-by-design code, fixes for findings
Yellow Team Builders/UX Design it safely. Product, design, and UX with safety-aware mental models Safe defaults, friction at risky moments
White Team Governance Oversee it. Run the exercise; arbitrate scope, scoring, rules of engagement Rules of engagement, scoring, final reports
Orange Team Awareness Yellow + Red — secure design teaching for product/UX Secure-design training, threat-model templates

Red / Blue / Purple are the three you'll be asked about. Green / Yellow / White appear in mature orgs.


Red Team

Q: What is an AI Red Team?

A team (people, agents, or both) whose job is to attack an AI system to discover failure modes before real adversaries do. Borrowed from military and cybersecurity practice. In AI, the targets are LLM applications, agents, RAG pipelines, and MCP tools — and the "vulnerabilities" are unsafe outputs, hallucinations, policy bypasses, data leakage, prompt injection, jailbreaks, biased or harmful content, and tool misuse.

Q: What does an AI Red Team actually do day-to-day?

  • Adversarial prompting — craft inputs designed to bypass system prompts, safety filters, or grounding
  • Jailbreak research — find new techniques (role-play, encoding, multi-turn escalation, hypothetical framing, multilingual attacks)
  • Indirect prompt injection — plant malicious instructions in documents, tool outputs, web pages, or emails the LLM will consume
  • Tool / agent misuse — get an agent to call tools it shouldn't, or with arguments it shouldn't
  • Data exfiltration attempts — extract system prompts, training data hints, PII from logs/memory
  • Policy bypass — get the system to violate stated policy (financial advice, legal advice, medical, illegal content)
  • Bias & harm elicitation — find inputs that surface demographic, political, or toxicity issues
  • Build adversarial corpora — turn every successful attack into a reusable test case

Q: What is a "Red Team agent"?

A specific 2025–2026 development: an LLM-driven agent that generates and runs attacks automatically against a target LLM system. Rather than a human writing each jailbreak by hand, the Red Team agent: 1. Takes a target purpose / system prompt as input 2. Generates adversarial prompts for chosen vulnerability categories 3. Sends them to the target system 4. Scores the responses (LLM-as-judge against compliance criteria) 5. Iterates — successful attacks are mutated into new attacks (genetic-style evolution)

Examples: DeepEval's red-teaming module, Promptfoo's red-team mode, Microsoft PyRIT, NVIDIA Garak, Lakera Red, Mindgard, Patronus. Anthropic, OpenAI, and Google all run internal red-team agents at scale before releases.

Q: Why use agents for red-teaming instead of humans?

  • Scale — millions of attack variants vs hundreds a human can write
  • Coverage — systematic across vulnerability taxonomies (OWASP LLM Top 10, MITRE ATLAS)
  • Continuous — runs in CI on every model/prompt change, not just before launch
  • Reproducibility — same seed and config = comparable runs over time
  • Cost — humans still find novel attack classes; agents amplify by exploring the space

Humans and agents are complementary: humans discover new attack classes, agents explore variations within known classes at scale.

Q: What are common vulnerability categories Red Teams target?

Category Example
Prompt injection — direct "Ignore previous instructions and..."
Prompt injection — indirect Malicious instructions hidden in a retrieved document or tool output
Jailbreak Role-play, hypothetical framing, encoding (Base64, ROT13), multilingual
PII / data leakage Get system prompt, training data fragments, or other users' data
Harmful content Violence, illegal activity, self-harm, CSAM, weapons
Bias / fairness Demographic, political, ideological skew
Misinformation Confident wrong answers, fabricated citations
Tool misuse Agent calls wrong tool, wrong args, or in unauthorised sequence
Authority escalation Agent performs action user shouldn't be permitted to trigger
Resource abuse Infinite loops, runaway costs, denial of wallet
Hallucination at the edge No-answer-available cases where model invents an answer

Q: What frameworks/standards guide AI Red Teaming?

  • OWASP LLM Top 10 — community-maintained list of top vulnerabilities for LLM apps
  • MITRE ATLAS — adversarial threat landscape for AI systems (attack tactics/techniques)
  • NIST AI RMF (Risk Management Framework) — governance + risk framing
  • EU AI Act — required testing/evaluation regime for high-risk AI in the EU
  • UK AISI / US AISI evaluations — government AI safety institute eval methodologies
  • Anthropic / OpenAI / DeepMind public model cards — informal benchmark of what to test

Blue Team

Q: What does an AI Blue Team do?

Builds and operates the defences. Where Red breaks, Blue protects. Blue Team work in AI typically includes:

  • Input guardrails — detect and block adversarial prompts before they reach the model (e.g. Lakera Guard, Rebuff, Llama Guard, NeMo Guardrails)
  • Output guardrails — detect and block unsafe, off-policy, or hallucinatory outputs before they reach the user
  • Indirect-injection defence — sanitise retrieved content, sandbox tool outputs, mark content as "untrusted"
  • System-prompt hardening — make prompts resistant to override; minimise leak surface
  • Policy engines — enforce who can call which tools with which args (RBAC for agents)
  • Monitoring & detection — log every request/response, flag anomalies, alert on attack signatures
  • Incident response — playbooks for when an attack succeeds; kill-switches; rollback procedures
  • Patching loop — every Red finding becomes a guardrail, a filter, or a fine-tune signal

Q: What does a Blue Team toolkit look like?

Layer Tools/Approaches
Input guardrails Lakera Guard, Rebuff, Prompt Guard (Meta), NeMo Guardrails, Azure AI Content Safety
Output guardrails Llama Guard, OpenAI Moderation, Guardrails AI, custom classifiers
Policy / RBAC OPA (Open Policy Agent), gateway policy layers, MCP gateway policy engines
Observability Langfuse, LangSmith, Arize, Helicone — full request/response tracing
Anomaly detection Rate-limiting, drift detection, attack-signature matching
Incident response Kill-switches, model rollback, prompt rollback, dataset rollback

Q: How does Blue Team work differ from traditional security blue team?

Traditional blue team protects networks and endpoints — signatures, IDS, EDR, SIEM. AI blue team protects model behaviour — much fuzzier. There's no antivirus signature for "this is a jailbreak"; defences are probabilistic classifiers, layered controls, and human review. Both share: defence-in-depth, telemetry-driven, runbook-driven incident response.


Purple Team

Q: What is a Purple Team?

The bridge between Red and Blue. Instead of Red attacking and then throwing findings to Blue, Purple is a joint working mode where both sides operate together — Red runs an attack, Blue immediately tries to detect/block it, both observe results, and they iterate in real time. The goal is to shorten the attack → detection → mitigation → regression-test loop from weeks to hours.

Q: What does a Purple exercise look like in practice for AI?

  1. Red runs a novel jailbreak against the LLM application
  2. Blue checks: did our input guardrail catch it? Output guardrail? Monitoring alert?
  3. If anything is missed, Blue builds the missing control while Red watches
  4. The successful attack becomes a permanent regression test in CI
  5. Red mutates the attack; Blue tightens; repeat until coverage is acceptable

Q: When should an organisation run Purple instead of pure Red/Blue?

  • When the attack-to-fix cycle is too slow (siloed teams)
  • When Blue defences are immature and need concrete attacks to motivate work
  • When you want auditable evidence that every known attack has a corresponding defence + regression test (great for regulated environments — PTB/PTO, EU AI Act, FCA, etc.)
  • When eval and security teams are small and can't afford separate workstreams

Q: How does QA / Quality Engineering relate to Purple?

QE is naturally Purple-shaped for AI systems — the QE runs adversarial tests (Red) but also owns the regression suite that codifies defences (Blue). A good QE for an agentic AI platform IS the in-house Purple function in everything but name. This is a great framing for the LSEG-style role: "Quality Engineering on an AI platform is fundamentally a purple-team function — every attack we find becomes a guardrail Blue owns and a regression I own forever."


Other Team Colours (less common but worth knowing)

Green Team

The engineers who build the system being tested. The targets of Red, the partners of Blue, the consumers of Purple's outputs. In AI: ML engineers, prompt engineers, application engineers.

Yellow Team

Builders/designers — product, UX, designers. In AI: prompt engineers, conversation designers, agent-experience designers. Yellow's job is "secure by design" — pick the safe default, add friction at risky moments, surface uncertainty.

White Team

Governance. Defines rules of engagement, scope, scoring, escalation. In AI: AI safety committee, model risk management, compliance, audit. White team writes the PTB/PTO criteria and arbitrates whether evidence meets them.

Orange Team

Yellow + Red. Teaches builders to think like attackers. In AI: secure-design training for prompt engineers; threat-model templates for new agent capabilities.


Common Interview Questions

Q: "How would you structure red-teaming for our agentic AI platform?"

"Three layers. Continuous automated red-teaming in CI — a Red Team agent (DeepEval's, Promptfoo's, or a custom one on PyRIT/Garak) running fixed vulnerability categories on every change, with thresholds gating release. Periodic human red-teaming — a quarterly or per-major-release exercise where humans probe for novel attack classes the agent can't yet generate. Purple working mode — Red and Blue jointly when we're hardening a specific surface, so every successful attack becomes a documented defence plus a permanent regression test. I'd also map findings to OWASP LLM Top 10 and MITRE ATLAS so coverage is auditable for governance reviews."

Q: "What's the difference between red-teaming and penetration testing?"

"Pen-testing is usually scoped to known vulnerabilities and known surfaces — find the bug in this API. Red-teaming is goal-oriented and open-scope — get the system to do something it shouldn't, by whatever path. For LLMs the distinction matters because the surface is the behaviour, not just the API. A pen-test might find an auth bug; a red-team exercise finds that an indirect prompt injection through a retrieved PDF bypasses the entire safety system."

Q: "How do you stop red-team findings from becoming a backlog of unfixed issues?"

"Two things. First, every Red finding gets a regression test the moment it's reproduced — even before the fix lands. That way the issue is known and tracked even if the fix is weeks out. Second, you tier findings by severity and have explicit SLAs — critical safety issues block release, medium issues get fixed within N releases with a documented mitigation in the meantime, low issues are tracked but not gating. Without tiers everything is "P1" and nothing actually moves."

Q: "Who should run AI red-teaming — security, ML, or QA?"

"All three, in different modes. Security owns the threat model and the rules of engagement. ML owns the model-level defences (system prompts, fine-tuning, RLHF data). QA — or quality engineering — owns the continuous red-teaming in CI, the regression suite, and the evidence that defences are working over time. The trap is making it any one team's exclusive job — it has to be a joint capability with a clear governance line through White."


Sound-Bites for Tomorrow's Interview

If the conversation turns to red-teaming or safety testing:

  • "I treat red-teaming as a continuous activity, not a launch-gate event. Every change runs the adversarial suite; every successful attack becomes a permanent regression test."
  • "For an MCP platform the highest-leverage red-team focus is indirect prompt injection — malicious instructions hidden in tool outputs or retrieved content. Direct injection is well-publicised and well-defended; indirect is where most real-world incidents happen."
  • "Quality Engineering on an AI platform is essentially a purple-team function — I run the attacks, I own the regression suite, and I work in close loop with whoever owns the runtime guardrails."
  • "I'd map our coverage to OWASP LLM Top 10 and MITRE ATLAS so the governance team has an auditable view of what we test, not just a metric score."