By NHI Mgmt Group Editorial TeamPublished 2026-04-15Domain: Governance & RiskSource: WorkOS

TL;DR: Enterprises are sitting on thousands of hours of video that remain effectively unsearchable, while AI-native platforms now promise multimodal extraction across speech, slides, visuals, and context, according to WorkOS's interview with Here founder Mazy Dar. The governance issue is not whether video can be indexed, but whether identity, access, and audit controls can keep pace once video becomes a first-class data source.


At a glance

What this is: This interview argues that AI-native video understanding is moving video from a passive archive into a queryable enterprise data source, with multimodal processing as the core technical shift.

Why it matters: That matters because IAM, access control, and audit design must extend to rich media content that can now be searched, quoted, and operationalised inside workflows.

👉 Read WorkOS's interview on AI-native video understanding and enterprise search


Context

Video governance is no longer just a storage problem. When recorded meetings, lectures, demos, and customer calls become searchable and reusable, the organisation is no longer protecting a static archive. It is governing a data source that can reveal sensitive context, decisions, and behaviour across the enterprise.

The identity question is whether access controls, data residency rules, and audit logging are designed for this kind of content lifecycle. Searchable video changes how information flows through collaboration, support, sales, and training systems, so the control plane has to move with it. That is a broader governance shift than transcription alone.


Key questions

Q: How should enterprises govern AI systems that make video content searchable?

A: Start by treating searchable video as governed knowledge, not passive storage. Define who may index it, who may query it, and who may export derived clips or quotes. Then pair classification, entitlement review, and audit logging so the original recording and its extracted outputs are controlled together across collaboration and workflow systems.

Q: Why do multimodal video platforms create new IAM and audit risks?

A: They create new artefacts that do not exist in a plain file share model: searchable moments, extracted quotes, and visual context attached to a queryable index. Those derivatives can be redistributed faster than the original content, so IAM teams need to govern access to the outputs and the audit trail around their reuse.

Q: What do security teams get wrong about transcription versus video understanding?

A: They often assume transcription is enough because it turns speech into text. In practice, transcription drops slides, body language, on-screen context, and other signals that matter for enterprise knowledge. A proper governance model has to account for richer extraction, which means broader exposure and stricter controls than transcript search alone.

Q: How can organisations decide whether video search is ready for production use?

A: Use three checks: latency that fits daily workflows, accuracy high enough for trust, and controls that cover access, data residency, and audit logging. If any one of those fails, the platform may be useful for experimentation but not yet ready to carry regulated or sensitive enterprise content.


Technical breakdown

Multimodal video understanding versus transcription

Transcription converts speech into text, but it strips out the visual layer that often carries the real meaning. Multimodal video understanding processes audio, on-screen text, slides, and visual context together, which lets a system retrieve specific moments rather than only transcript fragments. That makes the output more useful for search, quotation, and workflow integration. It also raises the bar for trust because the platform must preserve enough context to support accurate retrieval, not just keyword matching.

Practical implication: treat video indexing as a governed data pipeline, not a simple transcription feature.

Why video search becomes an infrastructure problem

The article frames video understanding as an infrastructure challenge because the files are large, the inference stack is multi-model, and latency matters if people are expected to use the system daily. A one-hour 1080p recording can exceed a gigabyte, so scaling requires orchestration across compute, storage, and model stages. The technical issue is not only accuracy, but throughput, cost, and reliability under enterprise workloads. If the platform cannot process content fast enough, it never becomes part of operational work.

Practical implication: evaluate performance, cost, and workflow fit together before making video search a production capability.

Access control and audit logging for searchable media

Once video becomes searchable and quotable, it behaves like structured enterprise knowledge rather than an inert file. That means access control must determine who can query, extract, and share content, while audit logging must show what was accessed and when. The article also points to data residency as part of the enterprise requirement set, which matters when meetings or training content crosses regions or business units. In practice, the security model has to govern derived outputs as well as the original recording.

Practical implication: align permissions, logging, and residency rules to the searchable output, not just the source media.


Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.


NHI Mgmt Group analysis

Searchable video turns unstructured media into governed enterprise knowledge. That changes the identity problem because access is no longer limited to viewing a recording. Users can query, extract, and redistribute moments from meetings, demos, and calls, which means the control plane must govern both the original object and its derived fragments. The practitioner conclusion is that media search is an access-governance problem as much as it is an AI problem.

Video understanding exposes a metadata governance gap that many enterprise programmes have not modelled. Traditional controls focus on files, folders, and applications, but multimodal systems create new searchable artefacts, embeddings, and quotable outputs. Those derivatives can outlive the original context that made them safe to view in the first place. The implication is that classification, retention, and entitlement models need to account for extracted knowledge, not just stored content.

Access controls, data residency, and audit logging become the minimum enterprise bar for AI-native media systems. The article correctly frames those requirements as adoption constraints rather than optional hardening. Without them, organisations may succeed at indexing video while failing at governing who can surface sensitive material across teams, regions, or workflows. Practitioners should treat the video pipeline as part of the identity perimeter.

AI-native video platforms will pressure IAM teams to expand governance beyond documents and code. Meeting recordings and training libraries now contain institutional knowledge that can be searched as easily as a knowledge base article. That creates a new class of enterprise data exposure where access decisions are made against rich media, not just text. The practitioner takeaway is to test whether current governance processes can handle media-derived content at scale.

From our research:

  • The average estimated time to remediate a leaked secret is 27 days, despite 75% of organisations expressing strong confidence in their secrets management capabilities, according to The State of Secrets in AppSec.
  • Organisations maintain an average of 6 distinct secrets manager instances, creating fragmentation that undermines centralised control.
  • The management challenge extends beyond storage into lifecycle control, as shown in Ultimate Guide to NHIs - The NHI Market, which maps the tooling landscape that identity teams must rationalise.

What this signals

AI-native video search will widen the gap between content ownership and content governance. Teams may believe they are simply improving discoverability, but they are actually creating a higher-value data plane that needs entitlement boundaries, auditability, and retention discipline. The practical signal is to review whether meeting recordings, sales calls, and training assets are already being treated as searchable knowledge outside the IAM model.

The strongest programmes will connect media governance to identity governance rather than running them as separate controls. That means integrating classification, access review, and logging into the same operational rhythm used for other sensitive enterprise systems. Where that does not happen, searchable video becomes an uncontrolled distribution layer for institutional knowledge.


For practitioners

  • Classify video content before indexing it Apply sensitivity labels to recordings, meeting archives, and training libraries before they enter any multimodal search workflow. If the system can surface quotes, slides, or screenshots, the classification must cover those outputs as well as the source file.
  • Scope access to searchable outputs, not just source files Review who can query, export, and share extracted moments from video in addition to who can open the original recording. Use least privilege to separate viewers, editors, and users who can push content into downstream workflows.
  • Instrument audit trails around retrieval and reuse Log the search queries, retrieved clips, shared snippets, and workflow destinations associated with video understanding systems. That gives security teams evidence when a sensitive meeting segment is reused outside its original business context.
  • Validate data residency before enterprise rollout Confirm where media is stored, processed, and indexed, especially when recordings cross region or subsidiary boundaries. If the platform uses external inference services, map those flows to your contractual and regulatory obligations before deployment.

Key takeaways

  • AI-native video understanding changes the control problem from storing recordings to governing searchable knowledge.
  • The enterprise risk is not transcription alone, but the creation of quotable, reusable media-derived outputs that need access and audit control.
  • IAM teams should test whether classification, entitlement review, and logging already cover video search before rollout expands.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST CSF 2.0, NIST Zero Trust (SP 800-207) and NIST CSF 2.0 set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
NIST CSF 2.0PR.AC-4Searchable video expands access scope and requires tighter entitlement management.
NIST Zero Trust (SP 800-207)Media search needs continuous verification around who can reach sensitive outputs.
NIST CSF 2.0DE.CM-8Audit logging is central when video queries and shared snippets become governed outputs.

Apply zero trust principles to media retrieval and limit access by context, purpose, and entitlement.


Key terms

  • Multimodal Video Understanding: Multimodal video understanding is the process of analysing video by combining speech, visual frames, on-screen text, and contextual signals. Unlike transcription alone, it preserves more of the meaning needed for search, retrieval, and enterprise workflows, which also means governance must cover richer derived outputs.
  • Searchable Knowledge: Searchable knowledge is information that can be queried, extracted, and reused inside operational workflows rather than merely stored. In a video context, it includes clips, quotes, slides, and context fragments, so identity and access controls must govern the outputs as carefully as the source content.
  • Data Residency: Data residency is the requirement to store and process information within approved geographic or contractual boundaries. For AI-native media platforms, residency must apply to the video file, the derived index, and any processed outputs, because sensitive content can cross boundaries during inference and search.

Deepen your knowledge

Video governance and entitlement design are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If your environment is beginning to treat media as queryable enterprise knowledge, it is worth exploring.

This post draws on content published by WorkOS: Mazy Dar on building the future of video understanding at here. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2026-04-15.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org