Multilingual AI security gaps expose English-first LLM defenses

By NHI Mgmt Group Editorial TeamPublished 2025-08-28Domain: Agentic AI & NHIsSource: Lakera

TL;DR: English-first guardrails leave LLMs exposed to prompt attacks, data extraction, and translation-based bypasses across many languages, according to Lakera’s analysis of real-world cases and research. Security teams need multilingual defenses, not just broader model coverage, because policy enforcement breaks when language handling is inconsistent.

At a glance

What this is: This is an analysis of how multilingual prompt attacks bypass English-centric LLM safeguards and why consistent security enforcement across languages is now required.

Why it matters: It matters because IAM and security teams governing AI agents, NHI-backed AI services, and user-facing GenAI need controls that hold across languages, not just in English.

By the numbers:

Attackers have successfully bypassed Gandalf’s guardrails in over 85 languages using techniques like code-switching, translation-based exploits, and multilingual data extraction.
Only 44% of developers are reported to follow security best practices for secrets management, exposing a significant developer behaviour gap.

👉 Read Lakera's analysis of multilingual AI security risks and bypass techniques

Context

English-only AI security assumes that a prompt judged harmful in one language will remain harmful in another, but that assumption does not hold in multilingual LLM environments. If moderation, jailbreak detection, and policy enforcement are trained or tuned primarily in English, attackers can shift languages, code-switch, or translate their requests to find blind spots in the control layer.

For IAM and security leaders, the issue is not language support as a user-experience feature. It is governance consistency across AI systems that may expose sensitive data, execute prompts, or mediate business workflows in dozens of languages. The article argues that multilingual protection has to be treated as a security baseline, not a localization add-on.

Lakera’s examples show this is not a theoretical edge case. The starting position is typical for teams that have inherited English-centric guardrails from model development and applied them globally without re-testing enforcement by language.

Key questions

Q: How should security teams test LLM guardrails across multiple languages?

A: Security teams should test the same harmful intent in every major language and in code-switched variants, then compare block, warn, and allow outcomes. The goal is to prove that policy enforcement is intent-based rather than English-based. If equivalent prompts produce different outcomes, the control is inconsistent and should not be trusted in production.

Q: Why do multilingual prompts increase the risk of AI data leakage?

A: Multilingual prompts increase leakage risk because safety controls often recognise restricted requests more reliably in English than in translated or transliterated forms. Attackers can use that gap to elicit sensitive data, especially where the model has access to internal documents or connected tools. Consistent enforcement across languages is essential when AI handles confidential information.

Q: What do teams get wrong about multilingual AI security?

A: Teams often assume that a model that understands many languages is automatically secure in those languages. Understanding is not the same as enforcement. A system can answer fluently across languages while still applying weaker moderation, weaker jailbreak detection, or weaker output filtering in non-English inputs.

Q: How can organisations tell whether multilingual safety controls are actually working?

A: They should measure whether the same policy decision holds across translations, transliterations, and mixed-language prompts. A working control produces consistent outcomes for equivalent intent, not just consistent performance in one language. Any gap between languages is a governance weakness, especially when the AI can access sensitive data or trigger downstream actions.

Technical breakdown

Why English-first moderation fails in multilingual LLMs

Most LLM safety stacks are built around training data, red-team prompts, and policy examples that are heavily English-weighted. That means the model may recognise dangerous intent in English but miss equivalent intent when it is translated, transliterated, or split across languages. Code-switching adds another failure mode because the harmful instruction is distributed across multiple language contexts, which weakens pattern matching and safety classification. The result is uneven moderation rather than a clean bypass, so the model behaves as if policy applies inconsistently across its own input space.

Practical implication: retest moderation and jailbreak detection by language family, not just by prompt category.

How translation-based exploits create policy drift

Translation-based attacks work because the security layer often evaluates a prompt after language transformation, not at the point where user intent first appears. Attackers can rephrase blocked instructions into lower-monitored languages, use transliteration to evade keyword-based controls, or ask the model to translate sensitive content out of a restricted context. This is not a model weakness alone. It is a control-design weakness where policy decisions depend on language assumptions that attackers can manipulate. If the policy engine cannot normalise intent across languages, the same request will be allowed in one form and denied in another.

Practical implication: add language-aware normalization and cross-language policy tests before relying on output filters.

Why multilingual data leakage is a governance problem, not just a model problem

Multilingual data leakage matters because sensitive information often travels through prompts, responses, and downstream automations. If one language path exposes personal data or confidential content more readily than another, the organisation has a governance inconsistency, not merely a moderation bug. That inconsistency becomes more serious in production GenAI because user trust, compliance obligations, and auditability all depend on uniform enforcement. Once attackers learn which language path weakens the guardrails, they can use that path repeatedly to extract information the policy was meant to suppress.

Practical implication: require pre-production and post-release multilingual abuse tests wherever AI handles sensitive data.

Threat narrative

Attacker objective: The attacker wants to bypass safety controls so the model reveals sensitive information or performs an action the English-first policy would have blocked.

Entry occurs when an attacker submits a harmful prompt in a non-English language, or mixes languages in a way that the safety stack has not been tuned to handle.
Credential access is replaced here by policy bypass, where the model accepts a translated or transliterated instruction that would have been blocked in English.
Impact follows when the model reveals restricted information, produces disallowed output, or applies inconsistent moderation that attackers can repeatedly exploit.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

English-first AI security is a governance assumption, not a technical constant. The article shows that many teams have treated multilingual support as a deployment detail instead of a control requirement. That assumption fails once the same harmful intent can be expressed in multiple languages, because enforcement no longer maps cleanly to policy intent. The practical conclusion is that multilingual enforcement has to be designed as part of the control model, not added after launch.

Multilingual prompt attacks create an identity control gap between intent and enforcement. In GenAI systems, the actor is not just a user submitting text, but a language-aware adversary selecting the weakest policy path. That means the risk sits in the enforcement layer itself, not simply in prompt content. For practitioners, the lesson is that AI safety tooling must be evaluated for language coverage the same way IAM teams evaluate authentication coverage across channels.

Multilingual data extraction is a secrets governance issue as much as an AI safety issue. Once a model can reveal different answers depending on the language path, organisations lose consistent control over sensitive information. That directly affects AI systems connected to NHI-backed services, data stores, and enterprise workflows. The implication is that AI governance, data protection, and identity control now overlap in the same runtime decision path.

Language coverage should become a measurable security control, not a vague product claim. Security leaders need to ask where coverage is actually tested, which languages are excluded, and whether policy outcomes are consistent across translation paths. This is the kind of control gap that can persist silently in production until an attacker finds it. Practitioners should treat multilingual evaluation as a standing acceptance criterion for any AI system that can touch sensitive data.

Multilingual security breaks the English-centric design bias already present in many AI programmes. The strongest defensive position is not to assume broader model capability automatically yields broader protection. LLMs can understand more languages than the surrounding security stack can govern. Practitioners should therefore align model rollout, moderation testing, and sensitive-data controls before expanding AI to new regions or user groups.

From our research:
Attackers have successfully bypassed Gandalf’s guardrails in over 85 languages using techniques like code-switching, translation-based exploits, and multilingual data extraction, according to LLMjacking: How Attackers Hijack AI Using Compromised NHIs.
43% of security professionals are concerned about AI systems learning and reproducing sensitive information patterns from codebases.
Multilingual AI security is the next control boundary to test, and Top 10 NHI Issues provides the broader identity risk context practitioners should use when expanding AI governance.

What this signals

Multilingual coverage now belongs in the control baseline for any AI system that can touch sensitive data. Teams that only test English-language prompts are effectively leaving a portion of their enforcement surface unexamined. The programme implication is straightforward: language coverage needs to be treated like authentication coverage, with clear acceptance criteria and repeatable tests before rollout.

The statistics are a warning sign, not a comfort blanket. As our LLMjacking research shows, attackers have already found more than one way to turn language variation into policy bypass. That means AI governance teams should assume the attack surface will continue to expand as models are deployed into more regions, more interfaces, and more business processes.

Cross-language policy drift: when the same intent is allowed in one language and blocked in another, the organisation does not have a moderation problem alone, it has a governance problem. Security leaders should connect AI safety testing to identity, data access, and change management so multilingual failures do not enter production unnoticed.

For practitioners

Test safety controls by language family Run jailbreak, prompt-injection, and data-extraction tests in the languages your users and attackers are most likely to use, then compare allow and block outcomes for the same intent.
Evaluate translation and code-switching paths Check whether the model can be induced to ignore or weaken policy when a request is split across languages, transliterated, or translated before moderation.
Add multilingual abuse cases to release gates Require pre-production approval to include non-English prompt sets, low-resource language tests, and output review for sensitive-data leakage before deployment.
Align AI safety with data governance Map where the model can access secrets, personal data, or confidential content, then verify that language-specific enforcement is consistent at each access point.

Key takeaways

English-centric AI safety creates uneven enforcement, and attackers can use language variation to bypass controls.
The evidence is already public and repeatable, which makes multilingual testing a release requirement rather than a nice-to-have.
Security, data, and identity teams should verify that the same policy outcome holds across translation paths before AI systems reach production.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	AGENT-03	Multilingual prompt injection and policy bypass map to agentic application input handling.
NIST AI RMF		Consistent governance and measurement are central to multilingual AI safety.
NIST CSF 2.0	PR.DS-5	Sensitive data exposure through AI output is a data security concern.

Test agent inputs across languages and transliterations before allowing sensitive tool access.

Key terms

Multilingual prompt injection: A prompt attack that uses more than one language, translation, or transliteration to weaken an AI system’s safety checks. In practice, the attacker is not changing the goal, only the linguistic path, which can cause the model to apply policy inconsistently across otherwise equivalent requests.
Code-switching: The act of mixing two or more languages within a single prompt or conversation. In AI security, code-switching matters because it can split harmful intent across language boundaries, making it harder for moderation systems to detect the request as dangerous even when the user’s meaning is clear.
Multilingual data extraction: The use of language variation to coax an AI system into revealing information it should keep private. This can happen when one language path is more weakly defended than another, allowing an attacker to retrieve confidential or sensitive content by changing how the question is phrased.
Policy drift: A condition where the same security rule produces different outcomes depending on language, channel, or input form. For AI systems, policy drift is a governance failure because it means enforcement is not stable enough to be trusted across the full user population.

What's in the full article

Lakera's full article covers the operational detail this post intentionally leaves for the source:

Examples of multilingual prompt attacks on Gandalf and why they bypass English-centric guardrails.
The article's comparison of code-switching, translation exploits, and multilingual extraction techniques.
The checklist for designing multilingual AI security controls across inputs, outputs, and monitoring.
The source's explanation of why low-resource languages create weaker safety coverage.

👉 The full Lakera article includes attack examples, test cases, and the multilingual security checklist.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity security are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building or maturing an identity security programme, it is worth exploring.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-08-28.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org