Why do PII discovery tools struggle with unstructured data?

Why This Matters for Security Teams

PII discovery tools struggle with unstructured data because the data carries meaning in language and context, not in predictable fields. A name in an email signature, a medical record embedded in a PDF, or an address inside a chat thread may all be personally identifiable, but each requires different interpretation. That is why broad scan-and-classify approaches often miss high-risk content or flood teams with false positives.

This matters operationally because discovery is often treated as a technology purchase instead of a data governance workflow. Security and privacy teams need to know where personal data exists, who owns it, how it is used, and when it should be retained or deleted. NHI Management Group’s Ultimate Guide to NHIs — Key Research and Survey Results shows how visibility gaps create lasting exposure in identity and access management, and the same pattern appears in data discovery when content is scattered across collaboration tools, file shares, and archived systems. In practice, many security teams discover sensitive content only after a retention dispute, an eDiscovery request, or an incident response review has already started.

How It Works in Practice

Effective PII discovery in unstructured environments combines several methods because no single detector is reliable enough on its own. Pattern matching still helps with obvious items such as social security numbers or phone numbers, but unstructured content needs contextual analysis, file inspection, and human review for edge cases. Current guidance suggests using a layered process rather than relying on one scanner across every repository.

Teams usually improve results by pairing classification with business context. For example, a document in a legal share may warrant different handling than the same string in a test dataset. File type, location, ownership, access history, and sensitivity labels all influence whether content is truly PII. NHI Management Group’s NHI Lifecycle Management Guide is about identity lifecycle discipline, but the same operational principle applies here: discovery only becomes useful when it is tied to ownership, review, and remediation.

Practitioners also need to account for modern content formats:

Emails and chat logs often contain indirect identifiers that only become sensitive when combined.

Documents and PDFs may hide PII in headers, footers, comments, or embedded objects.

Images and scans require OCR before text analysis can even begin.

Collaboration platforms can duplicate or fragment records across versions, threads, and exports.

For program design, the NIST Cybersecurity Framework 2.0 supports the broader idea that visibility, governance, and risk management must work together rather than as isolated controls. These controls tend to break down when repositories are highly dynamic, content is multilingual, or business teams store sensitive material in informal channels because the surrounding context is too variable for deterministic rules alone.

Common Variations and Edge Cases

Tighter discovery coverage often increases false positives, manual review, and operational overhead, so organisations must balance accuracy against the cost of investigating borderline content. That tradeoff is especially visible in unstructured repositories where one file can contain multiple data types, intended audiences, and retention obligations.

Best practice is evolving for AI-assisted discovery. Some tools now use machine learning or LLM-based classification to improve context recognition, but there is no universal standard for this yet, and results vary widely by content type and training data. That means human validation remains important for high-impact repositories such as HR records, legal archives, customer support transcripts, and executive communications.

Another common edge case is derived PII. A dataset may not contain direct identifiers, but it can still reveal a person when combined with reference data, file naming conventions, or metadata. Security teams should also avoid assuming that deletion is straightforward. Backups, synced copies, exports, and cached versions can keep sensitive material alive long after the source record is removed. NHI Management Group’s Top 10 NHI Issues highlights how visibility and lifecycle gaps create recurring exposure in identity programs, and the same governance gap explains why unstructured PII discovery often stalls in real environments.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST CSF 2.0, NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework	Control / Reference	Relevance
NIST CSF 2.0	GV.RM	Unstructured PII discovery is a risk management and governance problem.
NIST CSF 2.0	ID.AM	Discovery depends on knowing where sensitive content resides.
NIST AI RMF	MAP 1.1	Context-aware classification is an AI risk mapping use case.

Tie discovery outputs to risk owners, retention decisions, and remediation workflows.

#1 Authority in NHI Education, Research and Advisory, empowering organizations to tackle the critical risks posed by Non-Human Identities (NHIs), including AI Agents.

Why do PII discovery tools struggle with unstructured data?

Why This Matters for Security Teams

How It Works in Practice

Common Variations and Edge Cases

Standards & Framework Alignment

Related resources from NHI Mgmt Group