By NHI Mgmt Group Editorial TeamPublished 2026-04-20Domain: Agentic AI & NHIsSource: Lakera

TL;DR: Recent research reframes LLM hallucinations as an incentive problem, showing that next-token training and common evaluation methods reward confident guessing over calibrated uncertainty, while newer mitigation work favors uncertainty-aware reward shaping, targeted finetuning, and retrieval with verification, according to Lakera’s review of OpenAI, Anthropic, and recent conference papers. The security implication is simple: teams must design for visible uncertainty and controlled failure, not assume models will self-correct.


At a glance

What this is: This is a 2026 synthesis of LLM hallucination research showing that models are still incentivised to guess, not merely to answer accurately.

Why it matters: It matters because IAM, NHI, and AI governance teams increasingly rely on model output in workflows where plausible but wrong answers can drive bad access, compliance, or operational decisions.

By the numbers:

  • Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.

👉 Read Lakera's review of LLM hallucination research and mitigation methods


Context

LLM hallucinations are plausible but unsupported outputs, and the governance problem is no longer just model quality. In production settings, the issue becomes whether teams can trust generated text, generated decisions, or generated evidence when the system is optimised to sound confident even when it is uncertain.

For identity and security programmes, that changes the control question. If a model is embedded in search, support, coding, or decision workflows, then hallucination becomes a risk to access governance, secrets handling, audit evidence, and human override paths, not just a research benchmark failure.


Key questions

Q: How should security teams use LLM output without creating blind trust?

A: Security teams should require evidence-backed generation, human review for high-impact actions, and explicit confidence signalling for uncertain answers. Model output should be treated as advisory until the system can prove its claims against source material. That approach reduces the chance that fluent but false content becomes an access, compliance, or incident-response decision.

Q: Why do LLM hallucinations matter to IAM and NHI programmes?

A: They matter because model output can influence approval, documentation, secrets handling, and operational triage. If a model fabricates a fact or misstates a source, the downstream effect can be an incorrect identity decision or a false control assertion. In identity programmes, trust must be earned through verification, not conversational confidence.

Q: What do teams get wrong about hallucination reduction?

A: Many teams focus on prompt tuning alone, but prompt quality does not fix a system whose incentives reward guessing. The better question is whether the workflow lets the model refuse, defer, or surface uncertainty before a person acts on the answer. That is a control design issue, not just a model quality issue.

Q: How do organisations decide when an LLM is safe enough for production use?

A: They should evaluate the exact workflow, not the model in isolation. Safe enough means the system can verify claims, handle uncertainty, and prevent unreviewed output from changing records, access, or customer outcomes. If those safeguards are missing, the model is not production ready for that use case.


Technical breakdown

Why next-token training rewards confident guessing

Large language models are trained to predict the next token, which means they are rewarded for producing likely text, not for signalling doubt. When benchmark scoring and human preference tuning penalise abstention, the model learns that a fluent guess often beats a cautious refusal. That creates a systemic calibration problem: the model may know less than it sounds like it knows. Recent work cited in the article reframes this as an incentive mismatch rather than a simple factual error problem.

Practical implication: evaluate model outputs for calibration, not just accuracy, before allowing them into customer-facing or control-plane workflows.

How retrieval and verification reduce hallucination risk

Retrieval-augmented generation helps because it grounds answers in source material, but retrieval alone does not stop false claims. The article points to span-level verification, where each generated statement is checked against retrieved evidence, and unsupported spans are flagged or blocked. This matters because an LLM can still over-generalise, misread context, or fabricate citations even when documents are available. Verification shifts the system from best-effort answer generation to evidence-bound response production.

Practical implication: pair retrieval with claim-level verification before using model output as evidence in operational or compliance decisions.

Why multilingual and multimodal use cases fail differently

Hallucination rates are not uniform across tasks. The article highlights that multilingual benchmarks and multimodal reasoning tests continue to expose blind spots even in frontier models, especially when the model must combine text with images or operate in lower-resource languages. That means reliability is contextual, not universal. A model that performs acceptably in English chat can still confabulate when the input format, language, or evidence structure changes.

Practical implication: test the exact language and modality your programme will use, not just the model version listed on a vendor slide.



NHI Mgmt Group analysis

Hallucination is a governance problem before it is a model problem. Once LLM output is used in security operations, the failure is not simply incorrect text. It becomes mistaken approval, misplaced trust in fabricated evidence, or a bad downstream action taken because the model sounded certain. The relevant control question is whether the organisation can distinguish generated confidence from verified truth. Practitioners should treat model output as untrusted unless it is anchored and checked.

Calibration-aware systems are the right architectural direction. The article’s strongest point is that uncertainty must be represented explicitly rather than hidden behind fluent prose. That changes how teams design interfaces, escalation paths, and human review. A model that can refuse, defer, or surface low confidence is safer than one forced to answer every prompt. Practitioners should judge models by how they behave when they do not know.

Named concept: hallucination exposure window. This is the period between a plausible model output and the point at which someone verifies it, during which false information can propagate into tickets, code, reports, or access decisions. The shorter that window, the lower the operational risk. Teams should map where generated content is consumed without independent validation.

LLM reliability varies by task shape, not by model branding. The article shows that multilingual and multimodal settings remain fragile even as headline model quality improves. That means governance cannot rely on a single benchmark or a generic approval. Teams need context-specific acceptance criteria for each workflow, especially where outputs influence secrets handling, privilege decisions, or compliance evidence. Practitioners should validate the use case, not the logo.

From our research:

  • The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
  • Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.
  • The governance lesson is broader than secrets alone, as Ultimate Guide to NHIs , Lifecycle Processes for Managing NHIs shows why lifecycle controls matter when trust, rotation, and review drift apart.

What this signals

Hallucination risk becomes a control-plane issue the moment AI output is allowed to influence identity or security work. Teams should assume that every generated answer needs a verifiable handoff, because the failure mode is not just a wrong sentence. It is a wrong decision that inherits the model’s confidence. That is why evidence linking and review boundaries matter more than model fluency.

Named concept: hallucination exposure window. The shorter the distance between generation and verification, the less opportunity a fabricated or distorted claim has to enter records, tickets, or control decisions. For teams working across IAM, NHI, and agentic workflows, that window should be treated as a measurable risk indicator, not a soft UX concern.

Because developers already struggle to manage secrets consistently, model-generated guidance that sounds authoritative can accelerate bad habits rather than correct them. That makes governance around AI-assisted workflows part of identity hygiene, not a separate AI project. For broader control design, the NIST Cybersecurity Framework 2.0 remains a useful anchor for mapping trust, detect, and respond responsibilities.


For practitioners

  • Gate model output with evidence checks Require retrieval plus claim-level verification before generated text can feed tickets, reports, or access decisions. Treat unsupported spans as blocked content, not as acceptable ambiguity.
  • Measure calibration, not just accuracy Track refusal quality, uncertainty signalling, and false confidence alongside task accuracy so you can see whether the model knows when to stop.
  • Limit high-trust use to bounded workflows Allow unverified model output only in low-risk drafting or summarisation tasks. Keep any workflow that can change privileges, records, or evidence behind human review.
  • Test in the real language and modality Run evaluations in the same languages, document types, and multimodal inputs your production workflow will actually use, because reliability changes materially with context.

Key takeaways

  • LLM hallucinations are best understood as a calibration and incentive problem, not just a factual accuracy problem.
  • The risk matters operationally because plausible falsehoods can influence identity decisions, evidence handling, and security workflows.
  • Teams need verification, uncertainty signalling, and workflow boundaries before they allow model output to drive action.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
OWASP Agentic AI Top 10AG-04Addresses unsafe model outputs and refusal behavior in agentic systems.
NIST AI RMFCovers governance and measurement of AI uncertainty and risk.
NIST CSF 2.0PR.DS-6Relevant to protecting data and evidence used by model workflows.

Require refusal and evidence checks before agent output can trigger downstream action.


Key terms

  • Hallucination calibration: The degree to which a model’s confidence matches the correctness of its output. In practice, calibrated systems know when to refuse, hedge, or defer rather than producing fluent guesses that sound reliable but are not grounded in evidence.
  • Retrieval-augmented generation: A pattern where a model retrieves external documents before generating an answer. It improves grounding, but it does not guarantee truth unless the system also checks whether each claim is supported by the retrieved evidence.
  • Faithfulness error: A mistake where the model distorts, omits, or misrepresents the source material or prompt. Unlike a pure factual error, a faithfulness error can produce answers that are internally coherent but externally misleading, which is especially risky in security and compliance workflows.
  • Uncertainty signalling: The practice of making doubt visible in system output, such as through refusal, confidence scores, or fallback messages. It helps operators distinguish between verified answers and probabilistic guesses, which is essential when model output can influence identity or security decisions.

What's in the full article

Lakera's full article covers the research detail this post intentionally leaves in summary form:

  • The OpenAI, Anthropic, SemEval, ACL, and EMNLP papers cited in the article, including how each study changes the way hallucination risk should be measured.
  • The article’s breakdown of mitigation methods such as calibration-aware rewards, targeted finetuning, retrieval with span-level verification, and internal hallucination detection.
  • The practical distinction between factuality errors and faithfulness errors, which matters when you are deciding where verification should sit in the workflow.
  • The article’s discussion of multilingual and multimodal benchmark failures, useful if your use case is not plain English chat.

👉 Lakera's full article includes the cited studies, benchmark examples, and mitigation comparisons in more operational detail.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or governance maturity, it is worth exploring.
NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-20.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org