Who is accountable for model outputs that leak unsafe guidance through iterative probing?

Accountability sits with the organisation operating the model, because the risk is a governance failure in how reasoning visibility, logging, and runtime guardrails are configured. Security, AI platform, and policy owners need shared responsibility for the control plane, not post-incident blame shifting.

Why This Matters for Security Teams

Iterative probing changes the accountability question because the harmful output is not just a single model response, it is the result of an operational control failure across prompts, memory, tool access, logging, and human oversight. When unsafe guidance emerges only after repeated attempts, teams must look at the runtime governance design, not only the model card. Current guidance suggests that organisations operating agentic or prompt-driven systems should treat this as a control-plane issue, similar to how NHIs are governed across their lifecycle in the Ultimate Guide to NHIs — Why NHI Security Matters Now.

The practical risk is that blame often lands on the last person to review an output, while the real failure sits in policy configuration, guardrails, and escalation paths. That matters because unsafe guidance can be amplified through chained prompts, retrieval, or tool calls before anyone notices. NHI Management Group research shows that 79% of organisations have experienced secrets leaks, and 77% of those incidents caused tangible damage, which is a useful reminder that repeated exposure often becomes a business incident before it becomes an engineering ticket. In practice, many security teams encounter the accountability gap only after repeated probing has already produced a usable exploit path, rather than through intentional testing.

How It Works in Practice

Accountability should be assigned to the organisation operating the system, then broken down into shared responsibilities across security, AI platform, policy, and product owners. The operator is responsible for defining acceptable use, deploying guardrails, configuring escalation, and proving that unsafe outputs are monitored and contained. The model provider may contribute baseline safety features, but that does not transfer operational accountability for deployment choices.

For iterative probing specifically, the question is whether the system is resilient to repeated attempts, context accumulation, and jailbreak chaining. That means evaluating both prompt-time and session-time controls. The most effective patterns are runtime controls, not static policy statements. Examples include:

Policy-as-code checks that evaluate each request in context, rather than relying on a one-time approval.
Logging that preserves prompts, tool calls, retrieval results, and safety-filter decisions for review.
Guardrails that limit repeated probing, session escalation, and tool invocation after risky patterns emerge.
Role assignment that makes one team accountable for model risk decisions and another for operational enforcement.

For AI systems that behave autonomously, this is increasingly aligned with guidance from NIST AI Risk Management Framework and agent-focused security work such as OWASP and Cloud Security Alliance, where runtime governance and misuse resistance matter more than static policy language. The right accountability model also mirrors lessons from 52 NHI Breaches Analysis, where weak control design, not just credential theft, drives the incident outcome. These controls tend to break down when model access is embedded in high-volume support workflows because repeated retries obscure the difference between normal use and active probing.

Common Variations and Edge Cases

Tighter runtime control often increases operational friction, requiring organisations to balance safety against latency, user experience, and false positives. There is no universal standard for this yet, so accountability models should be explicit about where human review is mandatory and where automation is allowed.

One edge case is vendor-hosted models with configurable safety filters. Even then, the operator remains accountable for the deployment context, data exposure, and downstream harm. Another is multi-agent systems, where one agent can probe, another can retrieve, and a third can execute a tool call. In those environments, liability becomes harder to assign unless ownership of each control boundary is documented.

Security teams should also distinguish between accidental unsafe guidance and prompt-injection-driven misuse. The remediation path is similar, but the governance answer differs. Current practice suggests that organisations should assign one accountable owner for model risk acceptance, while control owners manage guardrails, monitoring, and incident response. That is especially important when systems are allowed to refine outputs through iterative prompting, because the harmful result may only appear after several seemingly benign exchanges.

In short, the accountable party is the operator, but the evidence of control failure should be spread across platform, security, and policy ownership so the same gap is not repeated in the next deployment.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A03	Iterative probing and unsafe output leakage map to runtime abuse and prompt-injection risk.
CSA MAESTRO	M1	MAESTRO addresses governance for agent behaviour, oversight, and safety controls.
NIST AI RMF		AI RMF is relevant to assigning responsibility for model risk and harmful output controls.

Use AI RMF GOVERN and MEASURE practices to document ownership, testing, and mitigation of unsafe outputs.

Who is accountable for model outputs that leak unsafe guidance through iterative probing?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group