Why do AI systems need red teaming beyond traditional penetration testing?

Why This Matters for Security Teams

Traditional penetration testing is built to find exploitable weaknesses in infrastructure, code, and configuration. AI systems fail differently. They can be steered by prompt injection, poisoned retrieval content, unsafe tool outputs, or hidden instructions that only appear at runtime. That means a system may pass a conventional security review and still behave unsafely when it encounters a hostile prompt or a deceptive data source. The NIST Cybersecurity Framework 2.0 is useful here because it frames risk management around outcomes, not just technical hardening.

This is why ai red teaming exists as a separate discipline: it stress tests model behaviour, tool use, and trust boundaries under adversarial conditions. NHIMG research on the DeepSeek breach shows how AI-related exposure can extend well beyond classic application flaws and into compromised content, credentials, and data handling. The security question is no longer only whether a system can be hacked, but whether it can be manipulated into unsafe action. In practice, many security teams discover these failures only after an agent has already accepted a malicious instruction or routed a sensitive action through the wrong tool.

How It Works in Practice

AI red teaming exercises the full decision path of a model or agent: user input, retrieved context, memory, orchestration layer, and tool execution. The goal is to find where trust is assumed but not enforced. Current guidance suggests testing not just the model prompt, but also the application logic that surrounds it, because the dangerous behaviour often emerges from the combination of model output and downstream automation. That is especially true for AI agents that can call APIs, search internal systems, or trigger workflows.

Effective red teaming usually includes several test classes:

Prompt injection against chat, retrieval, and agent workflows

Data poisoning or malicious context in knowledge bases and documents

Tool misuse, including over-broad permissions and unsafe action routing

Jailbreak attempts that cause policy bypass or unwanted disclosure

Cross-session and multi-step attacks that only succeed after chained interactions

That work aligns with the NIST Cybersecurity Framework 2.0 because the control objective is resilience, not only prevention. It also fits the threat patterns discussed in LLMjacking: How Attackers Hijack AI Using Compromised NHIs, where compromised identities and exposed secrets become the bridge from model misuse to real operational damage. For governance depth, teams should pair red teaming with The State of Secrets in AppSec because credential exposure and weak secrets hygiene often amplify AI attack impact. These controls tend to break down when the model has persistent memory, broad tool access, and production data connectivity because adversarial behaviour can compound across multiple turns.

Common Variations and Edge Cases

Tighter red teaming coverage often increases cost, operational friction, and false positive handling, so organisations must balance coverage against release speed. There is no universal standard for this yet, but best practice is evolving toward risk-based scenarios that mirror the actual deployment model rather than generic jailbreak scripts. A customer-facing chatbot, a private RAG assistant, and an autonomous workflow agent do not need the same test plan.

One common edge case is that traditional red teaming focuses on model refusal behaviour, while production risk often sits in the surrounding control plane. If the system can send email, approve transactions, modify records, or invoke internal APIs, the real test is whether the agent can be tricked into taking an action that exceeds user intent. Another edge case is retrieval-augmented generation: if malicious content is indexed as trusted context, the model may appear to be “working correctly” while still producing unsafe output. That is why current guidance suggests combining behavioural testing with permission review, sandboxing, and runtime policy checks. The NIST Cybersecurity Framework 2.0 and the NHIMG material on DeepSeek breach both point to the same operational reality: AI systems fail most dangerously where trust is implicit and automation is immediate.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A04	Red teaming targets prompt injection and unsafe agent actions.
CSA MAESTRO	A3	MAESTRO addresses agent threat modeling and adversarial testing.
NIST AI RMF	GOVERN	AI RMF governs risk oversight for behavioural AI failures.

Test agent workflows for prompt injection, tool abuse, and policy bypass before release.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Why do AI systems need red teaming beyond traditional penetration testing?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group