How should security teams run AI red teaming for GenAI systems?

Why This Matters for Security Teams

ai red teaming for GenAI systems is not a prompt-only exercise. The real risk sits at the seams between the model, retrieval layers, tool execution, and output channels, where a model can be induced to expose secrets, follow malicious instructions, or trigger unsafe actions. NHI Management Group’s research on the State of Secrets in AppSec shows 43% of security professionals are already concerned about AI systems learning and reproducing sensitive information patterns from codebases, which is exactly why red teams must test for leakage, not just harmful text.

Current guidance also points to structured evaluations rather than ad hoc jailbreak attempts. The NIST AI 600-1 GenAI Profile is useful here because it frames GenAI risk as a governance and operational issue, not a single-model defect. Red teams should therefore simulate realistic attacker paths through retrieval poisoning, tool misuse, data exfiltration, and unsafe action chaining. In practice, many security teams encounter the highest-risk failure modes only after the system is already connected to production data and external tools, rather than through intentional test design.

How It Works in Practice

Effective GenAI red teaming starts with a threat model that defines what the system can read, retrieve, transform, and execute. That means testing the whole workflow, not just the chat interface. A prompt injection that is harmless in isolation can become serious once the model can call ticketing systems, email, databases, or code execution tools. The test plan should explicitly cover direct prompts, indirect prompts embedded in retrieved documents, malicious tool outputs, and post-generation handling such as auto-posting or auto-approval.

A practical red team program usually includes three layers:

Model behavior tests: jailbreaks, policy bypass attempts, instruction hierarchy conflicts, and system prompt leakage attempts.

Retrieval and data tests: poisoning vector stores, surfacing sensitive documents, and forcing the model to quote protected content.

Action tests: tool abuse, privilege escalation through chained calls, unsafe file writes, and workflow manipulation.

Teams should measure outcomes in operational terms: did the system reveal secrets, over-disclose context, execute an action outside intent, or persist unsafe state? This is where NHI and secrets governance matter. If the GenAI stack relies on long-lived API keys or broad service accounts, red teaming should verify whether those credentials can be abused through the model path. The State of Non-Human Identity Security highlights how weak visibility and over-privileged accounts remain common attack conditions, which makes GenAI-connected NHIs especially worth testing.

For execution quality, teams often align scenarios with the kinds of adversarial paths discussed in the Anthropic Frontier Red Team technical analysis, then adapt them to the organization’s own data, tools, and policy boundaries. These controls tend to break down when agents or copilots have broad tool authority and weak output gating because the model can turn a language exploit into a real-world action.

Common Variations and Edge Cases

Tighter red teaming often increases operational overhead, requiring organisations to balance realistic adversarial testing against development speed and stakeholder tolerance. That tradeoff becomes more acute when systems are used for customer support, engineering assistance, or internal automation, because the same workflow may have different risk thresholds depending on the data and action scope.

There is no universal standard for GenAI red teaming depth yet, so guidance should be matched to exposure. For a model that only drafts text, the focus may stay on prompt leakage and harmful output filtering. For a system that can retrieve records or trigger actions, testing should expand to indirect prompt injection, privilege misuse, and downstream control failures. Best practice is evolving toward scenario-based testing with clear pass and fail criteria, not one-off “break the bot” demos.

Edge cases matter. Systems that separate the model from the action layer can still fail if outputs are trusted too early, and systems with strong model guardrails can still leak through retrieval or logs. Red teams should also test multilingual prompts, malformed inputs, and hidden instructions in documents, since adversaries rarely rely on a single attack style. When the environment includes vendor plugins, external APIs, or weak identity boundaries between services, the test surface becomes much larger than the chat experience itself.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	LLM-01	Prompt injection and unsafe output handling are central to GenAI red teaming.
CSA MAESTRO	M2	Covers agentic workflow abuse across prompts, tools, and actions.
NIST AI RMF		AI RMF supports structured risk identification and evaluation for GenAI systems.

Test model, retrieval, and tool boundaries together, then fix the highest-risk injection paths.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

How should security teams run AI red teaming for GenAI systems?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group