How can organisations reduce unsafe AI outputs without over-restricting users?

Why This Matters for Security Teams

Unsafe AI output is rarely just a prompt-quality problem. It usually reflects weak guardrails across the full request path: system instructions, retrieval sources, policy checks, and post-generation validation. For organisations that need useful AI at scale, the goal is not to block creativity, but to constrain harmful, misleading, or non-compliant output without turning every interaction into a denial of service. That balance is especially important when users rely on AI for drafting, summarisation, decision support, or customer-facing content.

Current guidance suggests treating this as a layered control issue, not a single filter problem. The NIST Cybersecurity Framework 2.0 is useful here because it reinforces the need to govern controls across identify, protect, detect, respond, and recover rather than relying on one preventive measure. NHIMG research on the DeepSeek breach shows how quickly exposed data and weak controls can create downstream risk once AI systems are operating on sensitive material. In practice, many security teams discover the problem only after users have already learned how to prompt around policy gaps.

How It Works in Practice

The most effective pattern is to combine policy at three points: before the prompt is processed, while the model is generating, and before the output is released. Safe system instructions set boundaries on tone, scope, and prohibited content. Runtime validation checks the model’s response against policy rules, domain constraints, and allowed data sources. Approved prompt templates reduce guesswork by giving users a sanctioned way to ask for common tasks without teaching them to bypass controls.

For organisations handling sensitive data, the prompt layer should also be paired with access governance and output monitoring. That means tying AI usage to role-based access, logging who requested what, and validating whether the response references confidential material, regulated advice, or disallowed actions. The point is not to eliminate flexibility, but to make the safe path the easiest path.

Use clear system instructions that define prohibited output, escalation rules, and domain limits.

Apply runtime filters for toxicity, leakage, policy violations, and unsupported claims.

Publish approved templates for recurring use cases such as summaries, email drafts, and knowledge lookup.

Log prompt, retrieval, and output events so security teams can detect policy evasion patterns.

Review exceptions by user role and data sensitivity rather than applying one blanket restriction.

The NIST Cybersecurity Framework 2.0 supports this layered approach because it encourages measurable controls and continuous improvement. NHIMG’s DeepSeek breach coverage is a reminder that AI systems can expose more than intended when retrieval, data handling, and permissions are not aligned. These controls tend to break down when teams allow broad retrieval over sensitive repositories because the model can faithfully amplify bad inputs even when the prompt itself looks harmless.

Common Variations and Edge Cases

Tighter output controls often increase friction, requiring organisations to balance safety against productivity and user trust. That tradeoff is real, especially in environments where staff need fast draft generation or where regulated teams must preserve an audit trail for every answer.

Best practice is evolving, and there is no universal standard for this yet. Some organisations rely heavily on pre-approved templates, while others prefer dynamic policy checks that adapt to the user’s role, the content category, and the destination system. The right choice depends on whether the biggest risk is unsafe public-facing language, sensitive data leakage, or policy drift across many teams. A finance or healthcare environment may need stricter review on generated advice, while a software engineering team may prioritise code-safety checks and secrets leakage prevention.

One useful rule is to reserve the strictest controls for high-impact use cases and allow lighter guardrails for low-risk productivity tasks. That avoids over-restricting users who only need drafting help, while still protecting scenarios where an unsafe answer could create legal, security, or operational harm. The practical test is simple: if a user can reach a risky answer by rephrasing the same request a few times, the control design is too brittle. For broader implementation context, the NIST Cybersecurity Framework 2.0 remains a solid baseline for governance and monitoring.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST AI RMF		AI RMF fits layered governance for safe, trustworthy model outputs.
OWASP Agentic AI Top 10	LLM07	Prompt and output handling map to unsafe response and prompt abuse risks.
NIST CSF 2.0	PR.DS-5	Data protection and control monitoring support safer AI output handling.

Use AI RMF to set governance, measure output risk, and monitor controls continuously.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

How can organisations reduce unsafe AI outputs without over-restricting users?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group