How should security teams test LLM guardrails across multiple languages?

Why This Matters for Security Teams

Multilingual testing is not a translation exercise; it is a control validation exercise. If a guardrail only blocks harmful prompts in English, attackers can shift intent into Spanish, Arabic, Hindi, Chinese, or code-switched variants and still reach the same risky outcome. That matters because guardrails often sit in front of tool use, retrieval, or workflow execution, where a single miss can expose data or trigger an unsafe action. Guidance from the NIST AI Risk Management Framework and OWASP’s agentic guidance both emphasize that controls must be evaluated against the actual risk condition, not the surface form of the prompt.

This is especially relevant for teams validating language model policy filters, prompt classifiers, and safety layers that claim global coverage. NHIMG research on OWASP NHI Top 10 and agentic application risk shows that inconsistent enforcement is often the real failure mode, not total bypass. In practice, many security teams discover weak language coverage only after a user proves the same malicious intent can pass through a non-English path.

How It Works in Practice

Testing should start with a canonical harmful intent set, then express each intent in the major languages your users and attackers are likely to use. The key is to keep the intent constant while varying the language, grammar, script, and phrasing. Teams should include direct translations, paraphrases, slang, abbreviations, typos, and code-switching. The outputs should then be compared for consistent block, warn, or allow decisions across every language pair.

A practical workflow usually includes:

Define one risk category at a time, such as self-harm, fraud, malware, data exfiltration, or policy evasion.

Create a seed prompt in English, then translate it and also rephrase it natively in each target language.

Test mixed-language prompts, especially where an English policy term is embedded in a non-English request.

Record whether the guardrail response is blocked, warned, downgraded, or allowed, and compare the outcome matrix.

Repeat the same test against retrieval, tool invocation, and agent handoff paths, not just chat responses.

This matters because many models and filters are trained more heavily on English, while policy intent may be interpreted differently across scripts and regional phrasing. The operational standard should be intent-based enforcement: the system should reject the same harmful request regardless of language. That aligns with the broader control intent described in the OWASP Agentic AI Top 10 and with implementation patterns discussed in the CSA MAESTRO agentic AI threat modeling framework.

NHIMG’s AI Agents: The New Attack Surface report found that 80% of organisations report their AI agents have already performed actions beyond their intended scope, which is a reminder that language filters are only one layer. These controls tend to break down when translated prompts are routed through separate moderation models, because each layer may score intent differently and create inconsistent policy decisions.

Common Variations and Edge Cases

Tighter multilingual guardrails often increase test effort and false positives, requiring organisations to balance broader coverage against usability and review overhead. That tradeoff is real, especially when teams support many locales or low-resource languages. Current guidance suggests prioritising the languages that map to user base, threat exposure, and business impact first, then expanding coverage through risk-based sampling rather than assuming parity from day one.

There is no universal standard for multilingual guardrail evaluation yet, so the main edge case is overconfidence in translation quality. Literal translation can miss idioms, coded language, honorifics, or culturally specific euphemisms that carry harmful intent. Teams should also test code-switched prompts, because attackers often blend English policy keywords with another language to evade brittle classifiers. When a model is wrapped by several vendors or moderation services, the policy boundary can become fragmented and harder to interpret.

For high-risk deployments, compare results against a single ground-truth policy and keep a failure log by language. The most useful metric is not pass rate alone, but divergence rate: how often the same intent receives a different outcome across languages. NHIMG breach research such as the LLMjacking threat analysis reinforces why this matters, because inconsistent policy enforcement becomes a practical attack path once credentials, tools, or downstream actions are in scope.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 and CSA MAESTRO address the attack and risk surface, while NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
OWASP Agentic AI Top 10	A1	Covers prompt injection and safety bypass across languages.
CSA MAESTRO	TR-3	Addresses threat modeling for agentic controls and policy gaps.
NIST AI RMF		Supports risk-based evaluation of AI controls and failure modes.

Measure multilingual guardrails against defined risk outcomes, not just English prompts.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

How should security teams test LLM guardrails across multiple languages?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group