By NHI Mgmt Group Editorial TeamPublished 2025-06-25Domain: Governance & RiskSource: StrongDM

TL;DR: Metrics, logs, and traces together improve system visibility, but each pillar breaks down under microservices sprawl, high-cardinality data, and sampling limits, according to StrongDM. For IAM and NHI teams, the real lesson is that observability data is only useful when identity and access events are centralized enough to explain who touched what, when, and through which path.


At a glance

What this is: StrongDM frames metrics, logs, and traces as the three pillars of observability, showing why each one is useful but incomplete on its own in distributed systems.

Why it matters: The article matters to IAM practitioners because the same fragmentation that complicates service telemetry also obscures non-human access, delegated privilege, and auditability across NHI, autonomous, and human identity programmes.

👉 Read StrongDM’s guide to metrics, logs, and traces in observability


Context

Observability is the ability to understand what is happening inside a distributed system from the data it produces. In cloud environments, the problem is not just system uptime, but whether teams can connect performance signals with identity and access events quickly enough to diagnose failure, abuse, or drift.

The article’s core governance point is that visibility alone is not control. For identity teams, metrics, logs, and traces become useful only when they can explain access paths, privilege use, and service-to-service behaviour across databases, servers, clusters, and other workloads.


Key questions

Q: How should security teams use observability data to investigate access issues in distributed systems?

A: Security teams should use metrics to detect anomalies, logs to reconstruct the identity trail, and traces to understand request flow across services. The key is correlation. Without shared request IDs, central log aggregation, and workload identity context, observability data becomes a collection of disconnected signals rather than evidence of who accessed what and through which service path.

Q: Why do metrics, logs, and traces still fail to give full visibility?

A: They fail when teams treat them as a checklist instead of a governance layer. Metrics can miss context, logs can fragment across services, and traces can be sampled. Full visibility depends on identity context, normalization, and correlation across the stack, otherwise the organisation can see events but cannot reliably explain them.

Q: What do organisations get wrong about observability in microservices?

A: They often assume more data automatically means better insight. In practice, more tags, more logs, and more traces can increase cost and noise while hiding the identity relationships that matter. The better approach is to design the telemetry model around the questions investigators need to answer during an incident or access review.

Q: How can teams tell whether observability is actually working?

A: Observability is working when a team can move from an alert to the exact request path, identity, and service dependency that caused it. If the answer still requires searching multiple tools, manually matching timestamps, or guessing which account acted, the organisation has monitoring data but not usable observability.


Technical breakdown

Metrics in distributed access monitoring

Metrics are time-series measurements such as latency, request rate, error rate, and resource consumption. In distributed systems, they are the fastest way to spot abnormal behaviour, but they do not explain the underlying sequence of events. High-cardinality tags create a scale problem because each new label combination multiplies storage and query cost. That means metrics are best for alerting and trend detection, not for reconstructing identity-driven incidents or proving which workload actually initiated access.

Practical implication: use metrics for early warning, then pair them with identity-aware logs before deciding whether an access issue is operational or security-related.

Logs and the identity trail

Logs are timestamped records of discrete events, often including who acted, what happened, where it occurred, and how the system responded. They are the strongest pillar for forensic reconstruction because they preserve context that metrics cannot. In microservices, though, logging becomes inconsistent across services, formats diverge, and volume rises fast. Without central aggregation and normalization, the identity trail fragments, and investigators lose the ability to connect a request to the credential, service account, or operator behind it.

Practical implication: centralize and normalize logs so access, authentication, and privileged actions can be correlated across services during incident review.

Traces and request-path visibility

Traces map the journey of a request across services, showing where latency, dependency failures, or bottlenecks appear. They are especially useful in containerized and API-heavy architectures because they expose sequence, not just outcome. But traces are sampled because capturing every request is expensive, which means they can miss rare but important access paths. For identity governance, that matters because an access chain can look healthy in aggregate while still hiding a privilege boundary failure in one service hop.

Practical implication: combine tracing with access telemetry so you can distinguish normal service latency from a privileged request path that crosses an unexpected boundary.


Read our 52 NHI Breaches Analysis report for a comprehensive view of breaches impacting Non-Human Identities including AI Agents.


NHI Mgmt Group analysis

Observability is now an identity problem, not just a telemetry problem. In distributed environments, the question is no longer whether systems emit data, but whether that data can explain access and privilege movement across services. Metrics, logs, and traces each illuminate part of the path, yet none of them is a governance model. For IAM and NHI programmes, this means observability should be treated as evidence of control coverage, not as a substitute for it. The practical conclusion is that identity context must be built into telemetry from the start.

Log fragmentation creates the same governance failure as NHI sprawl. When every service emits records differently, the result is not richer visibility but broken accountability. That mirrors the broader NHI problem where service accounts, tokens, and API keys proliferate faster than central oversight. The article’s real lesson is that decentralised systems punish organisations that rely on local logging habits instead of a shared access narrative. Practitioners should read this as a warning that auditability collapses when identity data cannot be correlated.

High-cardinality telemetry is a useful named concept for modern access control. The article shows how adding more tags or dimensions can make observability more expensive and harder to query, which is exactly what happens when teams try to layer identity signals onto poorly designed data structures. If access context cannot be queried at speed, it cannot support detection, forensics, or certification. The practical conclusion is that identity teams need data models that keep access evidence searchable before they add more volume.

SLO-driven observability is the right bridge between system health and identity governance. The article correctly points out that raw visibility is not the goal. Service level objectives turn telemetry into decision support by tying data to business expectations, which is also how identity programmes should evaluate privileged access, service availability, and incident impact. The important shift is from collecting more signals to proving whether access behaviour stays within the service outcomes the organisation actually needs. Practitioners should align observability outputs to governance objectives, not dashboard vanity.

The so-called fourth pillar only matters if identity is part of the signal chain. AI-assisted analysis, dashboards, and alerting can reduce noise, but they do not fix missing identity context. If an organisation cannot tell which account, token, or workload initiated a request, better visualisation simply makes the gap easier to see. The field should treat observability tooling as an amplifier of governance quality, not as a repair mechanism for weak access design. Practitioners should validate identity linkage before they scale analytics.

From our research:

  • 97% of NHIs carry excessive privileges, increasing unauthorised access and broadening the attack surface, according to Ultimate Guide to NHIs.
  • Only 5.7% of organisations have full visibility into their service accounts, which means most teams cannot reliably correlate access events with identity ownership.
  • For a broader view of how visibility, rotation, and offboarding fit together, see Ultimate Guide to NHIs , Key Challenges and Risks.

What this signals

Telemetry quality will increasingly be judged by identity correlation, not dashboard density. If teams cannot tie a trace or log entry to a workload identity, the observability stack will keep producing noise instead of operational evidence. For identity programmes, the practical shift is to treat access lineage as a first-class part of the data model, not an afterthought added during incident response.

High-cardinality access data is the next governance bottleneck. Once identity, workload, and service tags multiply across clouds, the cost of retaining and querying evidence can outpace the value of collecting it. Practitioners should plan for selective enrichment, consistent schemas, and access-event pivots that remain searchable under pressure.

The organisation that can answer who acted, what service path was used, and whether the privilege was expected will outpace teams still trying to reconcile metrics, logs, and traces after the fact. That is a governance advantage, not just an engineering one.


For practitioners

  • Map identity context into telemetry pipelines Add service account, workload, and operator identifiers to logs and traces so access events can be correlated across microservices without manual reconstruction.
  • Centralize log aggregation and normalization Standardize event formats before they enter the SIEM or observability stack, then preserve timestamps, request IDs, and privilege context for investigation.
  • Use metrics for detection, not proof Treat metrics as anomaly indicators and require logs or traces to confirm the identity path behind any access or performance issue.
  • Control cardinality before you add more labels Review tag design, limit unbounded dimensions, and keep telemetry queries usable so access evidence remains searchable at scale.
  • Tie observability to SLO and access objectives Define which service outcomes depend on which privileged paths, then measure whether access behaviour supports those outcomes consistently.

Key takeaways

  • Metrics, logs, and traces are useful only when they can explain identity and access movement across distributed systems.
  • Fragmented telemetry creates the same accountability problem as fragmented NHI ownership.
  • The practical test for observability is whether a team can reconstruct the request path and identity behind an event without guesswork.

Standards & Framework Alignment

This section maps relevant standards and security frameworks to the operational risks and controls described in this guidance.

OWASP Non-Human Identity Top 10 address the attack and risk surface, while NIST CSF 2.0 and NIST Zero Trust (SP 800-207) set the governance and control requirements practitioners need to meet.

FrameworkControl / ReferenceRelevance
NIST CSF 2.0DE.AE-1Observability data supports anomaly detection across distributed access paths.
NIST Zero Trust (SP 800-207)PAZero Trust depends on continuous verification and traceable access decisions.
OWASP Non-Human Identity Top 10NHI-06Service-account visibility and excessive privilege are central NHI governance gaps.

Inventory non-human identities and reduce standing privilege before extending observability.


Key terms

  • Metrics: Metrics are numeric measurements that describe system health over time, such as latency, error rate, or throughput. In identity and access environments, they are useful for detecting abnormal patterns quickly, but they cannot explain who acted or why without supporting logs and traces.
  • Logs: Logs are timestamped records of discrete system events that preserve context for investigation and audit. For NHI governance, logs become valuable when they are normalized and centrally searchable, because they can connect service actions, credential use, and administrative activity across distributed services.
  • Distributed Traces: Distributed traces record the path of a request as it moves through multiple services. They help teams see sequence, latency, and dependencies across a system, which makes them especially useful when identity-driven access decisions are spread across APIs, microservices, and workload boundaries.
  • High-cardinality Data: High-cardinality data contains many unique combinations of tags or labels, which makes querying and storage more expensive. In observability, this can reduce the usefulness of telemetry. For identity teams, it also makes access evidence harder to search and correlate at incident speed.

Deepen your knowledge

Observability, distributed access, and identity correlation are core topics in our NHI Foundation Level course, the industry's only accredited NHI security programme. If you are trying to connect telemetry with service account governance, it is worth exploring.

This post draws on content published by StrongDM: Three Pillars of Observability Explained: Metrics, Logs, Traces. Read the original.

NHIMG Editorial Note
Published by the NHIMG editorial team on 2025-06-25.
NHI Mgmt Group — the independent authority on Non-Human Identity, IAM, and Agentic AI security. nhimg.org