LLMs can scaffold complex UI, but accessibility still breaks

By NHI Mgmt Group Editorial TeamPublished 2025-07-29Domain: Best PracticesSource: WorkOS

TL;DR: A test of Claude, Gemini, and o3 on a tree-based combobox found that LLMs can scaffold compound component APIs quickly, according to WorkOS, but they still struggle with nested behaviour, keyboard support, screen-reader semantics, and state coordination in complex UI. That makes context, tests, and manual review the real guardrails, not prompt length alone.

At a glance

What this is: This is a WorkOS analysis of how three LLMs handled a tree-based combobox, and the key finding is that they were useful for scaffolding but unreliable for complex interactive behaviour.

Why it matters: It matters because IAM and platform teams are increasingly using LLMs to accelerate developer workflows, and the same failure modes that break UI composition also break identity-sensitive flows, access logic, and governance assumptions.

👉 Read WorkOS's analysis of LLMs building a complex tree combobox

Context

A tree-based combobox is a searchable dropdown with nested, collapsible choices, and it becomes difficult when keyboard behaviour, screen-reader semantics, and filtering logic all have to stay consistent at the same time. In enterprise software, that kind of component often sits close to identity and permissions workflows, so a small accessibility or state bug can become a governance problem as well as a usability problem.

The WorkOS test shows a familiar pattern: the models could sketch the structure quickly, but behaviour degraded once the component had to coordinate nested state, focus, and accessibility. For IAM teams, that is a reminder that AI-assisted development is strongest at pattern completion, not at preserving control semantics across edge cases.

Key questions

Q: How should teams use LLMs safely for complex UI components?

A: Use LLMs for scaffolding, boilerplate, and pattern completion, but require tests and human review for behaviour, accessibility, and state coordination. The safest workflow is to treat the model’s output as a draft that must pass keyboard, focus, and screen-reader checks before merge. For identity-adjacent UI, the acceptance bar should be stricter, not looser.

Q: Why do AI-generated components fail more often when nested interaction gets complicated?

A: Nested interaction creates competing event handlers, focus states, and render rules that are easy for a model to approximate but hard to keep consistent. The model may reproduce the visual structure while missing the interaction contract. That is why the risk rises sharply when expansion, selection, and text input all share the same component tree.

Q: What do security and platform teams get wrong about AI-assisted development?

A: They often assume that a plausible first draft means the hard part is solved. In practice, the hard part is preserving semantics under edge cases, especially when accessibility, keyboard control, or policy-sensitive UI state is involved. AI can shorten drafting time, but it cannot replace validation of the behaviour contract.

Q: When should teams prefer manual implementation over more prompting?

A: Prefer manual implementation when repeated prompting starts changing one bug into several new ones. That is a sign the model has lost the behavioural shape of the component and is optimising for surface resemblance. At that point, rewriting the critical paths is usually faster than continuing to iterate on unstable output.

Technical breakdown

Compound components can look right before they behave right

The first strength the article exposes is structural imitation. LLMs can mirror a compound-component API, especially when the design system already has strong patterns for the model to copy. But a tree-based combobox is not just markup. It has nested interactive state, selection rules, and disclosure behaviour that must all remain aligned. When a model generates the right component shape but not the right interaction model, the code looks plausible while still failing at runtime. That gap matters in enterprise UI because interface correctness is often inseparable from policy correctness.

Practical implication: Review AI-generated UI for interaction semantics, not just component structure.

Keyboard navigation and focus management are the real breakpoints

The article shows that nested disclosure inside a combobox introduces competing event lifecycles. Selection, expansion, and text input all want to own the same key events, and focus can disappear if the component tree is not coordinated carefully. That is why an apparently small UI request becomes a systems problem. In accessible components, focus order, keyboard intent, and announcement behaviour are part of the contract, not optional polish. LLMs tend to preserve appearance before they preserve those contracts.

Practical implication: Test keyboard paths and focus transitions before accepting AI-generated interactive components.

Filtering and accessibility fail when logic is inferred instead of specified

The article also highlights a common model failure mode: guessing at hidden rules. The LLM searched the wrong values, hid parents incorrectly, and used ARIA patterns that did not fit the actual interaction. That is what happens when intent is described in prose but not codified in tests or component contracts. For complex UI, behaviour needs executable definitions. Otherwise the model improvises across accessibility, filtering, and nested rendering in ways that are hard to detect until manual testing.

Practical implication: Encode expected behaviour in tests and file-level comments before asking the model to generate implementation.

Moltbook AI agent keys breach — Moltbook breach exposed 1.5M AI agent keys.
AI LLM hijack breach — attackers used stolen AWS access keys to hijack Anthropic LLM models on Bedrock.

Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.

NHI Mgmt Group analysis

AI-assisted UI work is most reliable at scaffolding, not at preserving control semantics. The article shows that LLMs can produce a credible starting point for a complex component, but the deeper the interaction model goes, the more they lose fidelity. That is a design-system problem as much as a coding problem, because enterprise UI depends on stable conventions for composition, state, and accessibility. The practitioner conclusion is that AI can accelerate the first draft, but it does not replace architecture discipline.

Complex interactive components expose a broader governance gap: prompt output is not the same as validated behaviour. The models reproduced surface patterns, yet failed where nested state, keyboard handling, and screen-reader support had to coexist. That failure mode is especially relevant to identity-adjacent interfaces, where user actions can carry security consequences. The practitioner implication is that teams should treat AI-generated code as untrusted until tests prove the control path works.

Context richness is becoming a control surface in AI-assisted development. The article’s strongest results came when the model had more monorepo and design-system context, which suggests that local conventions shape output quality more than generic model capability does. That is a workflow lesson for engineering organisations, but also an identity lesson: systems behave better when their boundaries and expectations are explicit. The practitioner conclusion is to invest in codified context, not just more prompting.

Behavioural drift in AI-generated UI is a named risk that deserves its own pattern language. Interaction fidelity debt: the gap between a generated component that looks right and one that preserves keyboard, focus, and accessibility behaviour. This debt accumulates when teams optimise for fast scaffolding without executable acceptance criteria. The practitioner conclusion is to measure AI output against behaviour contracts, not visual similarity.

For identity-rich software, the bar is not whether the model can build a component, but whether it can preserve the semantics that keep users, permissions, and policy aligned. The article demonstrates that the last mile is where autonomy breaks down into ambiguity, regression, and manual repair. The practitioner conclusion is to keep humans accountable for the semantic layer even when AI handles the first pass.

From our research:
80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems (39%), inappropriately sharing sensitive data (31%), and revealing access credentials (23%), according to AI Agents: The New Attack Surface report.
Only 52% of companies can track and audit the data their AI agents access, leaving 48% with a complete blind spot for compliance and breach investigation.
As AI-assisted development expands, teams should pair behavioural validation with governance discipline, as outlined in OWASP Agentic AI Top 10.

What this signals

Interaction fidelity debt: the next wave of AI-assisted engineering risk is not just wrong output, but output that is visually plausible and behaviourally incomplete. As more teams let models scaffold internal tools and design-system components, the programme risk shifts from typing speed to semantic drift, where keyboard flow, focus management, and accessibility are the first controls to fail.

The governance lesson is that context must become part of the development control plane. Teams that rely on AI for nested or stateful UI should expect more value from codified tests, local conventions, and design-system rules than from additional prompt iteration, and they should align those guardrails with patterns documented in the Analysis of Claude Code Security and the NIST AI Risk Management Framework.

For practitioners

Start with tests for complex components Define keyboard flows, focus transitions, filtering behaviour, and accessibility expectations before generating code. Use those tests as the acceptance gate for any AI-produced component, especially when the component affects permissions or other identity-adjacent workflows.
Keep in-file context close to the code Add comments, usage examples, and local design-system conventions directly in the files the model touches. Do not rely on chat history alone, because context loss is a major cause of regressions in nested interactive components.
Audit AI output for interaction semantics Review the generated code for event ownership, focus handling, and the correct ARIA pattern before merging. A component that renders correctly but fails on keyboard or screen readers should be treated as incomplete, not acceptable.
Use AI as a scaffolding layer, not the final implementer Let the model create the first draft of structure, naming, and boilerplate, then plan for manual completion of state coordination and edge cases. The moment behaviour becomes nested or stateful, the human review burden rises sharply.

Key takeaways

LLMs can accelerate complex UI scaffolding, but they remain unreliable when nested state, accessibility, and keyboard behaviour must all stay aligned.
The strongest signal in the article is behavioural drift, where a generated component looks correct while still breaking the interaction contract.
Teams should use tests, local context, and human review as the gate for AI-generated UI, especially in identity-adjacent workflows.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Agentic AI Top 10 address the attack and risk surface, while NIST AI RMF and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST AI RMF		AI-generated code quality depends on governance and validation controls.
OWASP Agentic AI Top 10		Agentic AI guidance helps assess model-driven software creation risks.
NIST CSF 2.0	PR.IP-1	Secure development processes need repeatable validation of generated code.

Use AI RMF governance practices to require testing and accountability before merging AI-generated UI.

Key terms

Compound Component: A compound component is a UI pattern built from multiple coordinated parts that share state and behaviour. In practice, it lets teams compose interfaces from reusable subcomponents, but it also demands tight control over focus, events, and accessibility so the pieces behave as one coherent control.
Interaction Fidelity Debt: Interaction fidelity debt is the gap between a component that looks correct and one that preserves the intended behaviour under real user interaction. It grows when generated code passes visual inspection but fails on keyboard flow, focus management, screen-reader support, or nested state transitions.
Design System Context: Design system context is the local information a model needs to produce code that fits an organisation’s existing component patterns. It includes naming conventions, composition rules, implementation examples, and behavioural expectations, and it materially improves the chance that AI-generated code matches real engineering standards.

Deepen your knowledge

Complex UI scaffolding, accessibility checks, and AI-assisted code review are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are building AI-enabled development workflows with identity-sensitive UI, it is worth exploring.

This post draws on content published by WorkOS: Vibecoding a complex combobox component. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-07-29.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org