What do security teams get wrong about AI-generated SQL for device investigations?

Why This Matters for Security Teams

AI-generated SQL is rarely just a drafting aid. For device investigations, the query itself becomes part of the control decision, so the real risk is not whether the SQL compiles but whether it preserves the investigative question, the scope of data access, and the intended blast radius. That is why security teams need to treat generated SQL as a governed artifact, not a convenience output. NIST’s NIST Cybersecurity Framework 2.0 still applies here because the workflow needs clear ownership, validation, and monitoring even when a model is helping draft the query.

The mistake is assuming the model is only translating intent into syntax. In practice, it can widen filters, change joins, expose more device telemetry than requested, or subtly alter time windows in ways that change the outcome of an investigation. That matters when the query is being used to triage endpoint compromise, lateral movement, or suspected policy violations. NHIMG’s DeepSeek breach research is a reminder that AI systems can surface or reproduce sensitive patterns at scale when governance is weak. In practice, many security teams discover query drift only after an analyst has already trusted an AI-written query against production device data.

How It Works in Practice

The safer pattern is to separate drafting from authorization. The model can propose SQL, but a human or policy engine must validate that the query still answers the exact control question. That means checking table scope, selected fields, joins, time bounds, and whether the query can enumerate more devices or users than the investigation requires. Guidance is evolving, but current best practice is to evaluate generated SQL against the original intent, not just against syntax errors.

Security teams usually get better results when they constrain the generation workflow:

Start from a fixed investigation template, then let the model fill only approved clauses.

Use least-privilege database accounts so the query cannot read beyond the investigation scope.

Require runtime review of the final SQL before execution against live device telemetry.

Log both the prompt and the rewritten query so reviewers can detect intent drift later.

Block destructive statements, broad exports, and unconstrained joins by policy.

This is where governance differs from ordinary code review. A query can be syntactically valid and still be operationally wrong if it expands the dataset, masks a relevant control condition, or changes the evidentiary meaning of the result. NIST’s framework is useful here because it emphasises response, measurement, and oversight rather than trusting the tool chain alone. The issue is reinforced by NHIMG’s LLMjacking: How Attackers Hijack AI Using Compromised NHIs research, which shows how quickly compromised identities and AI systems can be abused once trust boundaries are loose. These controls tend to break down when analysts are allowed to paste free-form prompts directly into production-connected query tools because the model can silently rewrite the investigative scope.

Common Variations and Edge Cases

Tighter query governance often slows investigation speed, requiring teams to balance analyst agility against data exposure risk. That tradeoff becomes sharper in high-severity incidents, where responders want immediate answers and are tempted to accept broader queries “just this once.” Current guidance suggests that emergency access should still be bounded by policy, but there is no universal standard for how much query expansion is acceptable during an active investigation.

Edge cases matter. Device investigations often involve incomplete telemetry, inconsistent schema across endpoint tools, and ambiguous control questions such as “Which devices exhibited suspicious behaviour?” An AI model may fill in those gaps with assumptions that look reasonable but change the evidence set. Multi-step prompts are especially risky because a model can carry forward an earlier mistake into later revisions. The better pattern is to validate the final SQL against the original question and, where possible, compare it to a known-good reference query before execution.

Teams should also be cautious when AI-written SQL is reused across incident types. A query tuned for malware hunting may be inappropriate for insider risk or compliance review because the data minimisation threshold is different. The practical lesson is simple: syntax checks are necessary, but they are not the control. The control is whether the generated query still expresses the exact investigative intent without expanding access beyond what the case requires.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	GV.RM-01	Generated SQL needs governed risk decisions, not just syntax checks.
NIST AI RMF		AI RMF addresses oversight, validity, and human accountability for model output.
OWASP Agentic AI Top 10		Prompted code generation can change scope and tool use, a core agentic risk.

Treat AI-written queries as governed artifacts and require review before production execution.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What do security teams get wrong about AI-generated SQL for device investigations?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group