By NHI Mgmt Group Editorial TeamPublished 2025-06-25Domain: Governance & RiskSource: Orca Security

TL;DR: OCR-based DSPM now extends sensitive-data detection into image files stored in cloud buckets, helping organizations find passport photos, ID cards, and other PII that traditional text-focused discovery can miss, according to Orca Security. The governance issue is broader than storage hygiene: identity proofs are now a data-classification and access-control problem, not just an application upload problem.


At a glance

What this is: This is an analysis of Orca Security's OCR-enhanced DSPM for finding sensitive data in image files stored across cloud environments.

Why it matters: It matters because image-based PII changes how teams scope discovery, classification, and access controls across cloud, IAM, and compliance programmes.

By the numbers:

👉 Read Orca Security's analysis of OCR for image-file sensitive data detection


Context

Image-file sensitive-data discovery is the problem here, not just cloud storage hygiene. When passport photos, driver’s licences, and similar identity documents land in buckets or file stores, teams inherit a classification, retention, and access-control obligation that text-only scanning often misses. That creates a direct identity governance issue because the same file can simultaneously contain personal identity evidence, regulated data, and operational credentials.

Orca Security's blog frames OCR as a way to extend DSPM into image files so organizations can identify PII, redacted samples, and associated storage risks across cloud assets. For IAM, IGA, and cloud security teams, the larger point is that identity documents are now part of the sensitive-data inventory and should be governed as such, especially where cloud-native applications and regulated onboarding flows overlap.


Key questions

Q: How should teams govern identity documents stored in cloud buckets?

A: Treat identity documents as regulated sensitive data, not as incidental uploads. Assign ownership, classify them at ingestion, restrict access to the minimum operational set, and tie deletion or archival to the business purpose that created them. If the files can contain PII, the storage location and retention policy need the same level of governance as any other controlled record.

Q: Why do image files create blind spots in sensitive-data discovery?

A: Traditional discovery often focuses on text-based repositories, so image files can escape detection even when they contain passports, licences, or other identity proofs. OCR closes that gap by making the content searchable and classifiable. Without it, organisations make risk decisions from incomplete inventories and can miss regulated data sitting in otherwise ordinary storage.

Q: What should security teams do when sensitive data is found in unstructured files?

A: Validate the data type, confirm the storage location is expected, and determine whether the issue is permissions, retention, or an unsafe workflow. Use redacted samples to support triage without exposing more content than necessary. Then remove unnecessary access and align the file’s handling with the policy that governs the underlying data class.

Q: How do cloud teams reduce exposure from uploaded identity documents?

A: Minimise who can reach the files, keep them out of broad shared locations, and apply retention limits as soon as the identity verification purpose is complete. Discovery should cover buckets, exports, and backups so the documents do not persist in unmanaged copies. The goal is to keep the storage model aligned to the document’s regulatory and operational value.


Technical breakdown

How OCR extends DSPM into image files

Optical character recognition turns unstructured images into searchable text signals that DSPM engines can classify. In practice, that means a system can inspect a passport photo or scanned ID card, extract visible identifiers, and map them to sensitive-data classes such as PII. The important distinction is that OCR does not secure the file by itself. It improves discovery and evidence generation, but the control value comes from what follows: classification, policy enforcement, and remediation workflows tied to the storage location and data type.

Practical implication: add image-file scanning to discovery workflows wherever user-uploaded identity documents can enter cloud storage.

Why cloud buckets become identity-data repositories

Cloud object storage often becomes an incidental repository for identity proofs because applications ingest uploads faster than governance catches up. Once stored, those files may sit alongside logs, backups, and application exports, expanding the exposed data surface. The security issue is not only where the data lives but whether the organisation can prove what is there, who can reach it, and how long it should remain. That is why bucket visibility, object-level classification, and retention controls need to be treated as part of identity-data governance rather than isolated storage management.

Practical implication: treat buckets holding identity documents as governed data stores with ownership, retention, and access review requirements.

What redacted samples change in investigation workflows

Redacted samples give analysts proof that sensitive data was present without exposing the full content to more people than necessary. That matters because detection without evidence often creates ambiguity in triage, while full disclosure creates unnecessary handling risk. In a DSPM context, the sample supports prioritisation, exception handling, and remediation validation. The deeper value is operational: security teams can confirm the data type, determine whether the location is expected, and decide whether the exposure is a policy issue, a permissions issue, or a retention issue.

Practical implication: use redacted samples to accelerate triage while preserving least-privilege handling of discovered files.


NHI Mgmt Group analysis

Identity documents are no longer just onboarding artefacts, they are governed data assets. Once government IDs move into cloud buckets, they become subject to discovery, access, retention, and revocation concerns that sit at the intersection of IAM, privacy, and data security. The field still treats many of these files as application by-products, but the control burden is the same as any other regulated identity record. Practitioners should treat uploaded identity proofs as part of the identity governance boundary, not as incidental attachments.

Image-file DSPM closes a visibility gap, but only after the organisation accepts that text-only discovery is incomplete. The main failure mode is not lack of policy language, it is lack of discovery across unstructured content. If passports, licences, and checks are stored in image form, a programme that only scans text repositories will systematically undercount exposure. The implication is that sensitive-data inventory must include binary and image formats, or the organisation will keep making risk decisions from partial evidence.

OCR makes PII classification more operational, not more discretionary. Once a system can identify sensitive data inside images, teams can no longer claim they lacked a practical way to find it. That shifts the governance question from whether the data exists to whether the organisation can classify it at scale and enforce handling rules consistently. For IAM and cloud teams, that means identity proof uploads should be mapped into the same control design as other regulated data classes.

Access control for image-based identity data needs to be tied to business purpose, not storage convenience. The presence of a passport image in a bucket does not justify broad access for developers, support teams, or downstream analytics. The more valuable the identity document, the more damaging broad object-store visibility becomes. Practitioners should narrow access to the minimum operational set and then validate that the storage model still reflects the document’s regulatory status.

Cloud-native applications are quietly turning identity verification into a long-lived data governance problem. The modern verification flow starts with a user upload and ends with files that may persist well beyond the original business need. That persistence creates compliance, breach, and retention exposure that is easy to miss because the collection event looked routine. Teams should assume every identity-document workflow has downstream governance consequences until proved otherwise.

From our research:

  • 80% of identity breaches involved compromised non-human identities such as service accounts and API keys, according to Ultimate Guide to NHIs.
  • Only 5.7% of organisations have full visibility into their service accounts, which shows how often identity inventory remains incomplete.
  • See Ultimate Guide to NHIs , Lifecycle Processes for Managing NHIs for the lifecycle controls that keep sensitive records and access paths aligned.

What this signals

Identity proofs now belong inside the same governance conversation as workload secrets and service accounts. Once uploads include passports or licences, the security programme has to manage both the file and the identity context around it. That creates a wider control boundary for IAM and DSPM teams, especially where cloud buckets, application logs, and downstream analytics all touch the same data.

Only 5.7% of organisations have full visibility into their service accounts, per the Ultimate Guide to NHIs, which is a useful reminder that visibility gaps are structural, not exceptional. The same pattern appears in identity-document storage when teams assume an upload path is automatically governed. Practitioners should expect hidden repositories, unmanaged copies, and inconsistent retention to surface once OCR expands discovery beyond text.

The practical signal for programmes is clear: if you cannot inventory the file, you cannot govern the identity record inside it. Teams that connect DSPM, IAM, and lifecycle controls will be better positioned to prove handling, reduce exposure, and defend retention decisions during audit or incident review.


For practitioners

  • Extend discovery to image formats Add OCR-enabled scanning to any storage path that can hold passports, licences, voter cards, checks, or screenshots containing identity data. Validate that uploads in buckets, file stores, and application exports are included in the same classification workflow as text records.
  • Classify identity proofs as regulated sensitive data Map uploaded identity documents into the same inventory as PII and other regulated records, with explicit ownership and retention rules. Make sure the policy covers redaction, exception handling, and deletion triggers after the original business purpose ends.
  • Review bucket access against business purpose Limit access to cloud objects containing identity documents to the smallest set of operational roles that genuinely need them. Re-check whether developers, support staff, and analytics pipelines can reach these files without a documented business need.
  • Use redacted evidence in triage workflows Require redacted samples for validation so analysts can confirm sensitive content without widening exposure. Tie the sample to remediation steps that remove unnecessary retention, correct permissions, or move the records into a governed repository.

Key takeaways

  • OCR-enabled DSPM changes image files from blind spots into discoverable sensitive-data assets, which is essential when those files contain identity evidence.
  • The main governance risk is incomplete inventory, because cloud buckets can quietly become long-lived repositories for regulated identity documents.
  • Teams should tie discovery, access restriction, and retention to the business purpose of each uploaded identity file, not just to the storage platform.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST CSF 2.0, NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
NIST CSF 2.0ID.AM-1Identity document discovery depends on knowing where sensitive assets live.
NIST CSF 2.0PR.DS-1Protecting data at rest applies to image files holding identity proofs.
NIST Zero Trust (SP 800-207)Cloud bucket access to identity files should follow continuous verification and least privilege.

Inventory image-file repositories and map them into the organisation's sensitive-asset register.


Key terms

  • Data Security Posture Management: Data Security Posture Management is the practice of finding, classifying, and reducing risk around sensitive data across storage locations. In cloud environments it focuses on where data lives, who can reach it, and whether exposure, misconfiguration, or retention problems create avoidable risk.
  • Optical Character Recognition: Optical Character Recognition is the process of extracting readable text from images so software can search and classify the content. In security workflows, OCR helps discover sensitive data hidden inside scans, photos, and screenshots that normal text-based tools would miss.
  • Sensitive Data Inventory: A sensitive data inventory is a structured record of where regulated or high-risk information exists, what type it is, and who is responsible for it. It gives security and compliance teams the evidence needed to apply access controls, retention rules, and remediation consistently.
  • Redacted Sample: A redacted sample is a limited preview of detected sensitive content that confirms the finding without exposing the full data. It helps analysts validate classification and priority while reducing the chance that investigators themselves become an unnecessary exposure path.

Deepen your knowledge

NHI governance, agentic AI identity, and machine identity lifecycle are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are responsible for identity security strategy or NHI governance in your organisation, it is worth exploring.

This post draws on content published by Orca Security: OCR-based DSPM for sensitive data in image files. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-06-25.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org