Subscribe to the Non-Human & AI Identity Journal

Malicious LLM

A large language model used or adapted for offensive purposes, often to remove safety guardrails that restrict harmful output. For practitioners, the concern is not the label alone but the operational effect: more persuasive social engineering, faster iteration, and higher campaign throughput.

Expanded Definition

A malicious LLM is a model that has been tuned, repurposed, or operationally weaponised to produce harmful output at scale. In NHI security, the term matters because the model is not merely “untrusted” content generation; it can become an active enabler for credential theft, phishing, malware assistance, or policy evasion when coupled with compromised identities and weak guardrails. Definitions vary across vendors, especially when distinguishing a malicious LLM from a benign model that is simply misused, but the operational question is whether the model is optimised for harmful tasks and deployed to achieve them.

For governance teams, this sits adjacent to agentic AI risk and LLMjacking, where attackers abuse AI access paths or secret exposure to run models at scale. Standards and risk guidance such as the NIST AI 600-1 Generative AI Profile and the OWASP Top 10 for Agentic Applications 2026 focus on misuse pathways, prompt injection, and control failures rather than the model label alone. The most common misapplication is treating any harmful output as proof of a malicious LLM, which occurs when a legitimate model is exposed to adversarial prompts or unsafe tool access.

Examples and Use Cases

Implementing detection and containment for malicious LLM activity often introduces a tradeoff between model openness and operational control, requiring organisations to weigh developer agility against abuse resistance.

  • An attacker modifies or fine-tunes a local model to generate high-volume phishing copy that matches internal tone and job roles.
  • A criminal workflow uses a model to iterate social engineering lures faster than human review can flag them, increasing campaign throughput.
  • Stolen API keys are used to access hosted AI services, then the output is redirected into fraud, extortion, or reconnaissance workflows, a pattern discussed in LLMjacking: How Attackers Hijack AI Using Compromised NHIs.
  • Security teams compare the malicious use case against the normal agentic control surface described in OWASP NHI Top 10 and the external OWASP Agentic AI Top 10.
  • Red teams use a hostile model to test whether employees, copilots, or autonomous agents will comply with harmful instructions despite policy controls.

In practice, the term is also used when a model has safety layers deliberately removed to bypass refusal behaviour, even if the underlying base model is otherwise mainstream. That distinction matters because the security response changes from content moderation to identity protection, access restriction, and downstream tool isolation.

Why It Matters in NHI Security

Malicious LLMs are dangerous in NHI environments because they amplify every weak identity boundary around models, tools, and automation. When secrets, service accounts, or agent credentials are exposed, attackers can move from prompt abuse to persistent capability abuse. NHIMG research shows how quickly this escalates: when AWS credentials are exposed publicly, attackers attempt access within an average of 17 minutes, and as quickly as 9 minutes in some cases, a pattern highlighted in LLMjacking: How Attackers Hijack AI Using Compromised NHIs. That speed turns model misuse into an identity incident.

NHIMG analysis of AI Agents: The New Attack Surface report shows that 80% of organisations report AI agents have already performed actions beyond intended scope, including revealing access credentials. That makes malicious LLMs relevant not just to content risk but to privilege escalation, data exfiltration, and fraud. The right response includes identity scoping, output controls, logging, and revocation paths aligned with the NIST AI Risk Management Framework and the CSA MAESTRO agentic AI threat modeling framework. Organisations typically encounter the full impact only after a model-assisted breach, at which point malicious LLM behaviour becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 and OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Non-Human Identity Top 10 NHI-02 Malicious LLM abuse often starts with stolen or mismanaged credentials.
OWASP Agentic AI Top 10 A2 Covers prompt abuse and unsafe tool execution in agentic systems.
NIST AI RMF Frames generative AI risk handling across governance, mapping, and measurement.

Constrain model outputs, tool calls, and escalation paths before harmful automation can execute.