LLM-based identity risk scoring exposes the limits of IAM heuristics

By NHI Mgmt Group Editorial TeamPublished 2026-03-17Domain: General NHISource: Okta

TL;DR: Okta describes an experiment that feeds raw identity event sequences into an LLM, then uses token-level surprise and peak perplexity to score anomalous logins and sessions instead of relying only on engineered features, according to Okta. The shift matters because it tests whether IAM can evaluate behaviour as a sequence, not just as a set of counters and rules.

At a glance

What this is: This is an analysis of LLM-based adaptive identity risk scoring that treats log sequences as native text and uses perplexity to flag unusual authentication behaviour.

Why it matters: It matters to IAM and NHI practitioners because the same pattern could reshape how autonomous identities, service accounts, and agent sessions are scored, reviewed, and controlled.

👉 Read Okta's analysis of LLM-based identity risk scoring

Context

Identity risk scoring has usually depended on fixed features built from authentication telemetry, such as new IPs, unusual devices, and unfamiliar geographies. That works for bounded user sessions, but it becomes less reliable when the identity behaves as a sequence of events instead of a single login. For NHI governance, the question is whether current IAM models can still distinguish normal from risky when the actor is a service account or agent that changes context quickly.

The article uses Okta's experiment to test a different model: keep the event stream in its text form, let an LLM learn the narrative, and score the next event by how surprising it is. That approach is still experimental, but it reflects a broader shift in identity security toward behaviour-based evaluation of humans and NHIs alike. The starting point is typical of modern identity teams trying to enrich risk engines without replacing them.

The operational challenge is not whether risk scoring can be made more sophisticated. It is whether the resulting signal can be governed, calibrated, and explained well enough to support access decisions, step-up authentication, and automated response for both human and non-human identities.

Key questions

Q: How should security teams use LLM-based identity risk scoring in production?

A: Security teams should use LLM-based identity risk scoring as an input to policy, not as an autonomous decision-maker. The model is best suited for ranking unusual events, triggering step-up authentication, and prioritising review. Production use requires threshold calibration, monitoring for drift, and clear escalation rules for both human identities and NHIs.

Q: What is the difference between traditional IAM risk scoring and sequence-based scoring?

A: Traditional IAM risk scoring usually compares a login against engineered features such as new IPs or devices. Sequence-based scoring evaluates the full event history, so it can detect when the order, timing, or combination of events looks abnormal. That makes it better at context mismatch, but also harder to tune and explain.

Q: Why do NHIs make adaptive risk scoring harder?

A: NHIs generate more frequent, more automated, and less human-like event patterns than employee accounts. That means normal behaviour can change quickly across tools, environments, and workloads. Adaptive scoring has to distinguish legitimate automation from compromise, which requires identity ownership, clean telemetry, and policy thresholds designed for machines, not just people.

Q: When should organisations escalate a high-risk identity score?

A: Organisations should escalate when the score aligns with privileged access, unfamiliar context, or a deviation from the identity’s normal sequence of events. The best response is usually step-up verification, temporary restriction, or review by an identity owner. Scores alone should not create irreversible blocks without supporting policy and context.

Technical breakdown

How LLM-based identity risk scoring works

The model treats authentication telemetry as a sequence of structured text rather than as a fixed feature vector. Each event is cleaned to remove noisy tokens such as transaction IDs and then concatenated with recent identity history to form a profile-conditioned prompt. A language model is trained to predict the next event, and the loss is measured only on the target event. That lets the model learn what a plausible next login or session looks like for a specific identity. The core idea is sequence modelling, not rule matching.

Practical implication: Practitioners should evaluate whether their telemetry pipeline preserves enough event context to support sequence-based scoring.

Why peak perplexity is more useful than raw log averages

Perplexity measures how surprised the model is by a sequence. In identity risk scoring, standard averages can be distorted by boilerplate log text that appears in every event, so peak perplexity isolates the most informative tokens and ignores repetitive structure. That makes deviations in country, device, user agent, or issuer context more visible. The method is still probabilistic, which means it surfaces uncertainty rather than proving malicious intent. In practice, it is a ranking signal, not an accusation engine.

Practical implication: Teams should tune thresholds conservatively and validate which token changes actually correlate with risky access.

How adaptive scoring changes the meaning of identity context

Traditional IAM risk engines usually compare an event against historical aggregates. An LLM-based model compares it against the story of the identity, including the order in which events occur. That matters when a service account, bot, or agent behaves differently across time, tools, and environments. The result is a more flexible anomaly detector, but also a harder governance problem because model outputs can drift as identity behaviour changes. For NHI programs, this makes feedback loops and review workflows essential.

Practical implication: Practitioners should pair adaptive risk scores with human review paths and periodic recalibration.

Threat narrative

Attacker objective: The objective is to obtain access that appears legitimate long enough to bypass static IAM checks and reach privileged systems or sessions.

Entry occurs when an attacker or anomalous actor triggers a sign-on or access sequence that looks ordinary at the surface but diverges in the details of context.
Escalation follows when the model detects a sharp mismatch in country, device, or user agent patterns that indicates the identity is operating outside its normal behavioural envelope.
Impact is the ability to surface risky access sooner, before the session reaches privileged applications or sensitive identity-bound resources.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

Sequence-aware risk scoring is the right direction for identity security, but it does not replace governance. An LLM can surface anomalous behaviour that fixed feature vectors miss, especially when the identity history matters as much as the current event. That improves detection sensitivity, yet it also introduces calibration risk if the model is treated as an authority rather than a signal. Practitioners should treat the output as one layer in a governed risk decision.

Identity risk models are moving closer to behavioural context, and that widens the NHI problem space. Service accounts, API keys, and agent sessions do not behave like employees with predictable schedules and locations. A model that learns narrative context can help, but it also shows why conventional IAM controls struggle when the actor has no human workflow. The practical conclusion is that NHI governance needs telemetry-aware policy, not just credential management.

Peak perplexity is a useful concept because it captures the identity blast radius of unusual events. The score does not merely ask whether a login is rare. It asks which parts of the event most violate the established identity story, which is a better lens for both abuse detection and explainability. That is especially relevant when the same identity may span automation, integrations, and agentic workflows. Practitioners should design reviews around context mismatch, not just failed logins.

Adaptive scoring validates continuous verification as a security model, but it complicates access operations. Zero trust assumes trust must be re-evaluated, and this article shows one way that re-evaluation could become more dynamic. The downside is operational churn if every context shift is treated as high risk. Teams should therefore anchor these scores in policy tiers, escalation paths, and identity ownership before they put them into production.

From our research:
90% of IT leaders say properly managing NHIs is essential for a successful zero-trust implementation, according to Ultimate Guide to NHIs.
91.6% of secrets remain valid five days after the targeted organisation is notified, which shows how slowly identity remediation can move in practice.
For a broader baseline, 52 NHI Breaches Analysis shows how compromised machine identities repeatedly expand incident scope and delay containment.

What this signals

Behavioural risk scoring will matter most where identity context changes faster than policy can keep up. For IAM teams, that means the real programme question is not whether a model can detect anomalies, but whether response workflows can act on them without breaking legitimate automation. The governance gap is becoming more visible as organisations move from static accounts to machine-driven access patterns.

LLM-based scoring creates a new control layer, but control layers only help when they are auditable. If a model drives step-up authentication or session restriction, teams need to know what signals triggered the decision and how to override it safely. That is especially true for NHIs, where the blast radius of a false positive can be operational as well as security-related.

Identity teams should expect policy engines and anomaly engines to converge. With 97% of NHIs carrying excessive privileges, according to Ultimate Guide to NHIs, the next practical step is not just better detection. It is linking risk scoring to privilege boundaries so that context changes automatically translate into narrower access.

For practitioners

Implement sequence-aware risk evaluation Test whether your IAM telemetry can preserve ordered identity events, not just aggregate counters. A sequence-aware pipeline should retain country, device, user agent, and session timing so the model can compare behaviour over time.
Calibrate peak-perplexity thresholds Validate which token-level changes actually correlate with suspicious access, then set conservative thresholds for step-up MFA or session review. Use historical identity baselines to separate noise from meaningful context shifts.
Map scores to governance actions Define what happens when a high-risk score appears, including manual review, temporary restriction, and incident escalation. Make sure the decision path is documented for both human identities and NHIs.
Separate detection from authorization Use the model as a risk signal, not as a sole access control. Access decisions should still rely on policy, ownership, and least privilege, with the score influencing step-up controls rather than replacing them.

Key takeaways

LLM-based risk scoring shifts identity security from feature counting to sequence understanding, which is a meaningful architectural change.
Adaptive scoring can improve detection of unusual access patterns, but it also increases the need for calibration, explainability, and policy control.
For NHIs, the operational priority is to connect behavioural risk signals to governance actions that reduce privilege and speed response.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10		Adaptive identity scoring matters when agents and automation act with tool access.
NIST CSF 2.0	PR.AC-4	Risk-based access decisions support least privilege and continuous verification.
NIST Zero Trust (SP 800-207)		Continuous reassessment of trust aligns with zero trust principles for identity sessions.

Tie agent access decisions to runtime risk signals and require step-up controls for context shifts.

Key terms

Peak Perplexity: Peak perplexity is a risk-scoring method that focuses on the most surprising tokens in an identity event rather than averaging the whole log line. It helps surface unusual context changes, such as a new country or device, while reducing noise from repetitive system text.
Sequence-based Risk Scoring: Sequence-based risk scoring evaluates identity behaviour as an ordered series of events instead of a single login snapshot. This approach can detect context shifts that aggregated counters miss, especially when the order and timing of events carry the real security signal.
Profile-conditioned Prompting: Profile-conditioned prompting feeds an identity's recent event history into a language model before scoring the next event. The model uses that historical context to judge whether the current event fits the expected behavioural pattern, which makes the score adaptive to the specific identity.
Identity Context Mismatch: Identity context mismatch occurs when an event conflicts with the established pattern for that identity, such as a sudden change in geography, device, or access path. In NHI and IAM programs, it is a useful indicator of compromise, automation error, or policy drift.

Deepen your knowledge

LLM-based identity risk scoring and NHI behavioural governance are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are evaluating how adaptive scoring fits into your identity programme, it is worth exploring.

This post draws on content published by Okta: analysis of LLM-based identity risk scoring and adaptive anomaly detection. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-03-17.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org