Subscribe to the Non-Human & AI Identity Journal

Open Lakehouse

An open lakehouse is a data architecture that combines the flexibility of data lakes with warehouse-like structure, performance and governance expectations. It is valuable because it supports broad analytics and AI use cases, but it also demands tight traceability so governance does not fall behind platform change.

Expanded Definition

An open lakehouse is a data architecture that merges lake flexibility with warehouse-like governance, schema discipline, and performance, while favoring open table formats and interoperable tooling. In NHI and agentic AI environments, the term matters because data access, lineage, and policy enforcement often cross teams, engines, and automation workflows.

Definitions vary across vendors, especially around how much “openness” must exist at the storage layer versus the query and governance layers. Practically, an open lakehouse is not just about where data sits, but whether access paths, metadata, and policy controls remain traceable as platforms evolve. That aligns closely with the governance intent of the NIST Cybersecurity Framework 2.0, particularly where asset visibility and access control depend on consistent records rather than ad hoc platform features.

When NHI-driven pipelines write to or read from an open lakehouse, the security question is not only “can the job run?” but “can the identity, permission, and data movement be audited end to end?” The most common misapplication is treating open lakehouse as a vendor-neutral label for any lake platform, which occurs when teams ignore the governance gap between open storage and controlled operational access.

Examples and Use Cases

Implementing an open lakehouse rigorously often introduces metadata and policy overhead, requiring organisations to weigh cross-platform flexibility against the cost of maintaining consistent governance.

  • A data engineering team uses an open table format so batch jobs, BI queries, and AI feature pipelines all read the same governed dataset without duplicating permissions.
  • An automation agent writes telemetry into a lakehouse while an access policy layer records which NHI performed the write, when it occurred, and which downstream jobs can reuse it.
  • A security team reviews lineage for a sensitive dataset to confirm that API keys used by scheduled transformations are tracked and rotated, not buried inside pipeline definitions, as highlighted in the Ultimate Guide to NHIs.
  • An analytics platform separates raw landing zones from curated tables so machine learning workloads can scale without bypassing governance, while still aligning with the access discipline described by the NIST Cybersecurity Framework 2.0.
  • An incident response team traces an anomalous service account back to a specific table write path, showing why open formats are useful only when identities and permissions remain observable.

Why It Matters in NHI Security

Open lakehouses become security-critical when NHIs, agents, and automated data jobs share the same storage and compute fabric. If service accounts are over-privileged or poorly inventoried, the lakehouse can turn into a high-speed path for data exposure rather than a governed analytics layer. NHIMG reports that 97% of NHIs carry excessive privileges, and that only 5.7% of organisations have full visibility into their service accounts, a combination that is especially dangerous in architectures where many engines touch the same data plane.

That is why open lakehouse governance must include identity lifecycle controls, traceable permissions, and reviewable access paths. The architectural promise of openness can fail when engineers prioritise portability but do not maintain evidence of who or what accessed sensitive tables, which secret was used, or whether the workload still needs that access. The Ultimate Guide to NHIs frames this as a visibility and lifecycle problem, not just a storage problem. Organisations typically encounter the operational need for open lakehouse governance only after a compromised pipeline, at which point the term becomes operationally unavoidable to address.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
OWASP Non-Human Identity Top 10 NHI-01 Open lakehouse access depends on governing NHIs and service accounts across data paths.
NIST CSF 2.0 PR.AC-4 Open lakehouse governance relies on managing access permissions for machine identities.
NIST Zero Trust (SP 800-207) AC-6 Zero Trust requires explicit, least-privilege access for every automated lakehouse workload.

Inventory and constrain every NHI that can read or write lakehouse data, then verify least privilege continuously.