Subscribe to the Non-Human & AI Identity Journal

How should organisations classify data that may become PII when combined with other records?

They should classify it by re-identification potential, not by the label on the source field. Data such as device IDs, location data, or behavioural logs may be non-identifying alone but become PII when joined with other records. Governance should therefore assess combinations, downstream uses, and system-to-system linkage before deciding access and retention rules.

Why This Matters for Security Teams

Data classification breaks down when teams treat field labels as the whole story. A device identifier, location trail, or behavioural log may look harmless in isolation, yet become personal data once it is combined with other records. That is why classification has to follow re-identification potential, downstream use, and linkability, not just source-system semantics. NIST’s NIST Cybersecurity Framework 2.0 supports this broader risk-based view of information handling.

This matters because access control, retention, sharing, and logging rules all depend on the classification decision. If data is under-classified, teams may expose personal information through analytics, APIs, or system joins that were never reviewed as a privacy risk. If it is over-classified, organisations often create unnecessary friction and weaken data usability. NHI Mgmt Group’s Ultimate Guide to NHIs — Key Research and Survey Results shows how often identity-related data is already exposed in weak control environments, which is a reminder that classification must account for operational reality, not just policy intent. In practice, many security teams discover re-identification exposure only after analytics pipelines or data-sharing workflows have already multiplied the original risk.

How It Works in Practice

Effective classification starts by asking whether the data can identify a person alone, whether it can do so in combination, and whether a recipient could reasonably link it to other datasets. Current guidance suggests treating re-identification as a function of context, not a fixed property of the source field. That means a record may be non-PII in one system and effectively personal data in another once it is joined with account metadata, timestamps, geolocation, or customer support history.

Security and privacy teams should align on a few operational steps:

  • Map data elements to likely linkage sources, such as customer master data, device inventories, and event logs.
  • Classify based on the strongest plausible combined identifier, not the weakest standalone field.
  • Apply purpose limitation to analytics, sharing, and model training so downstream use does not silently change the classification.
  • Reassess classification when new datasets, APIs, or enrichment services are introduced.
  • Set access, retention, and masking rules using the highest credible re-identification risk.

For teams building a control baseline, the NIST CSF guidance helps translate this into governance, while privacy engineering practices increasingly borrow from data-minimisation and contextual access models. Where non-human identities are involved, the same logic applies to service accounts and automation pipelines because they often move data across trust boundaries faster than human reviewers can track. This is especially important for systems that ingest telemetry, identity graphs, or behavioural analytics, because re-identification risk can emerge through routine joins rather than deliberate misuse. These controls tend to break down when data is copied into loosely governed data lakes or BI tools because linkage paths multiply faster than classification reviews can keep up.

Common Variations and Edge Cases

Tighter classification often increases operational overhead, requiring organisations to balance privacy protection against analytics speed, reporting accuracy, and storage cost. That tradeoff is real, especially where teams need to compare datasets across business units or preserve long-lived research records.

There is no universal standard for every edge case, but a few patterns recur. Pseudonymised data is not automatically non-PII if the re-linking key exists elsewhere. Aggregated data may still be sensitive if the cohort is small or if repeated queries make individual inference possible. Behavioural telemetry can appear anonymous until it is combined with account login events or device fingerprints. In those cases, current guidance suggests classifying the dataset by the risk of identification in the receiving environment, not by the formatting of the source record.

For higher-risk programs, teams should document assumptions about linkage resistance, review whether access belongs behind role-based controls or more restrictive purpose-based controls, and revisit the classification whenever new enrichment or sharing partners are added. That is the most practical way to keep privacy controls aligned with real-world data movement, rather than with the original system design.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

NIST CSF 2.0, NIST CSF 2.0 and NIST AI RMF set the governance and control requirements practitioners need to meet.

Framework Control / Reference Relevance
NIST CSF 2.0 ID.IM-1 Supports ongoing risk identification for data that changes sensitivity when linked.
NIST CSF 2.0 PR.DS-1 Addresses protection of sensitive data based on how it is processed and exposed.
NIST AI RMF Risk mapping and governance apply when datasets can reveal identity through combination.

Review combined-data re-identification risk under ID.IM-1 before assigning access and retention rules.