How should organisations test AI models that handle sensitive data?

Why This Matters for Security Teams

Testing models that handle sensitive data is not just about whether the model answers correctly. The security question is whether the model can be induced to expose training data, memorised secrets, or adjacent records through prompt design, output shaping, or API misuse. That risk is especially visible in incidents like the DeepSeek breach, where exposed data and embedded secrets showed how quickly sensitive information can become discoverable at scale. NHI Management Group research also shows why this discipline matters: the Ultimate Guide to NHIs highlights how quickly identity-related weaknesses compound across systems.

Current guidance suggests treating model testing as a security validation exercise, not a quality check. That means evaluating leakage, recoverability, and unintended memorisation before release, then retesting after retraining, fine-tuning, tool changes, or changes to the API surface. The NIST Cybersecurity Framework 2.0 aligns to this mindset by framing risk management as an ongoing function, not a one-time approval. In practice, many security teams discover leakage only after users have already queried the model with real data, rather than through intentional pre-release testing.

How It Works in Practice

Effective testing starts by defining what sensitive data the model might expose: training corpus fragments, fine-tuning examples, embedded prompts, connected retrieval content, and any secrets that might be returned through tooling. The model should then be exercised with adversarial prompts against the exact interface production users will call, because behaviour often changes between notebooks, sandboxes, and the live API. Security teams typically combine red teaming, membership inference checks, output reconstruction attempts, and prompt variation to see whether the model reveals memorised content or can be steered into reproducing protected details.

For organisations handling regulated data, the test plan should be explicit and repeatable:

Probe for verbatim recall of names, identifiers, records, or code-like secrets.

Test whether context windows, retrieval layers, or logs leak protected material.

Measure whether paraphrasing, translation, or formatting changes increase recoverability.

Retest after fine-tuning, model updates, guardrail changes, or new integrations.

This approach is consistent with the security posture described in NHI research, where identity and access boundaries are treated as operational controls rather than static assumptions. It also reflects the NIST view that controls should be verified against the actual threat environment, not just documented in policy. Where applicable, teams can use the NIST Cybersecurity Framework 2.0 to anchor testing, evidence collection, and remediation tracking.

These controls tend to break down when models are connected to retrieval systems, plugins, or downstream automation because sensitive data can escape through the surrounding workflow even if the base model itself appears stable.

Common Variations and Edge Cases

Tighter testing often increases cost and release friction, requiring organisations to balance assurance against model delivery timelines. That tradeoff is real, especially when teams are operating under rapid retraining cycles or frequent prompt and tool updates. Current guidance is still evolving on the exact thresholds for acceptable memorisation, so there is no universal standard for this yet. Many organisations therefore define risk-based acceptance criteria by data class rather than trying to eliminate every possible disclosure path.

Edge cases matter. A model used only for internal drafting may tolerate different leakage thresholds than one processing health, financial, or customer identity data. Similarly, synthetic data can reduce exposure but does not remove the need to test for inference, because models may still reproduce patterns from the underlying source set. If the model is wrapped in an agentic workflow, the testing scope should extend to tool calls, retrieval permissions, and any place where the model can trigger data movement outside the intended boundary. The practical lesson from the DeepSeek breach is that exposure often comes from the surrounding system, not just the model weights.

Security teams should treat approval as conditional and revisit it whenever the data boundary changes, because sensitive-data risk rarely stays contained to one model version or one deployment path.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	PR.DS-1	Sensitive data testing maps to data protection and leakage risk.
NIST AI RMF		AI RMF supports ongoing testing for harmful model behaviour and disclosure.
OWASP Non-Human Identity Top 10	NHI-03	Sensitive model access often depends on exposed secrets and identities.

Verify model paths, logs, and retrieval layers against PR.DS-1 before production approval.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

How should organisations test AI models that handle sensitive data?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group