What do security teams get wrong about using LLMs for exact calculations?

Why This Matters for Security Teams

The common mistake is treating an LLM like a calculator with a conversational interface. For exact math, that assumption is risky because the model is optimised to predict likely text, not to guarantee numerically correct output. In security and compliance workflows, a fluent but wrong calculation can cascade into incorrect risk scoring, bad thresholding, and false confidence in automated decision-making.

This matters most when LLM output is used to support controls, not just commentary. Teams often discover the problem only after an analyst has accepted a plausible result, a report has been published, or an automation has executed on the wrong numbers. The broader pattern is visible across agentic systems too: AI Agents: The New Attack Surface report found that 80% of organisations reported AI agents performing actions beyond intended scope. That same overtrust shows up when models are asked to do work that should be deterministic, especially in security operations and governance. Current guidance from NIST AI Risk Management Framework and the OWASP Agentic AI Top 10 both point toward bounded use, explicit validation, and tool-based controls rather than blind trust in model output. In practice, many security teams encounter math errors only after a report, script, or agent has already operationalised the wrong answer.

How It Works in Practice

For exact calculations, the safer pattern is to separate reasoning from computation. Let the model explain the task, interpret the result, or orchestrate the workflow, but route the arithmetic to a deterministic engine such as a calculator, query, rules system, or code execution step with known inputs and outputs. That approach reduces ambiguity and makes review easier.

Security teams should also define what counts as a “math task” versus a “language task.” For example, summarising a percentage change is different from calculating exposure, reallocating access limits, or computing a time-windowed control threshold. The latter should be validated by deterministic logic and ideally logged with the original inputs. This is consistent with the State of Non-Human Identity Security, which shows how confidence gaps persist when teams rely on systems they cannot fully verify. In adjacent operational contexts, OWASP NHI Top 10 and NIST AI 600-1 Generative AI Profile both support the same practical direction: constrain model autonomy, validate outputs, and preserve traceability.

Use the LLM for interpretation, drafting, and orchestration.

Use deterministic tools for arithmetic, thresholds, and reconciliation.

Validate every calculation that drives a security decision.

Log inputs, formulas, and outputs so reviewers can reproduce the result.

Teams that combine a model with an explicit calculation step get better control over error, drift, and auditability. These controls tend to break down when the LLM is embedded directly into workflow automation without a separate verification layer because the wrong number can become the basis for an automated action.

Common Variations and Edge Cases

Tighter validation often increases workflow friction, requiring organisations to balance speed against assurance. That tradeoff becomes more visible when teams want the model to “just do the math” inside chat, tickets, or agent pipelines without adding a second system.

There is no universal standard for this yet, but current guidance suggests a few practical exceptions. Minor arithmetic inside narrative text may be acceptable if the result is not security-critical and a human can spot-check it. By contrast, multi-step calculations, large-number operations, percentage rollups, and anything that changes access, prioritisation, or compliance status should be treated as deterministic work. The same caution appears in AI LLM hijack breach, where trust in model-led behaviour can extend beyond the original intent once the system is operating in a live workflow.

For organisations building agentic tooling, the issue is not only arithmetic accuracy but also control separation. The safest design is to prevent the model from inventing numbers, then require an external tool, rules engine, or calculator service to produce the final value. That is especially important in environments with compliance reporting, privileged access decisions, or incident automation. These patterns align with the CSA MAESTRO agentic AI threat modeling framework, which treats unverified model behaviour as an operational risk rather than a convenience.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	N/A	Addresses unsafe model autonomy and output trust in agentic workflows.
CSA MAESTRO		Covers threat modeling for agentic workflows that mix reasoning and execution.
NIST AI RMF		Requires measurement, validation, and governance for AI outputs used operationally.

Separate model reasoning from deterministic calculation and log every tool-mediated result.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

What do security teams get wrong about using LLMs for exact calculations?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group