TL;DR: Keeping signal extraction consistent and reducing drift across scoring and aggregation now depends on running a Signals DAG across 3 production systems, including 2 online services at up to 35k QPS and 1 batch job processing 3TB daily, according to Abnormal AI. The governance lesson is that explicit data dependencies matter more than isolated model tuning when detection pipelines scale.
NHIMG editorial — based on content published by Abnormal AI: Signals DAG architecture and production scaling in the detection engine
By the numbers:
- Abnormal AI says its Signals DAG runs across 3 production systems, including 2 online services at up to 35k QPS and 1 batch job processing 3TB daily.
Questions worth separating out
Q: How should security teams prevent drift in large-scale detection pipelines?
A: Security teams should define each derived signal once, then reuse that definition across scoring, aggregation, and reporting paths.
Q: Why do separate batch and realtime systems create governance risk?
A: Separate batch and realtime systems create governance risk because the same behaviour can be interpreted through different logic at different times.
Q: How do you know if a feature pipeline is becoming too complex to trust?
A: A feature pipeline is becoming too complex to trust when engineers must remember hidden dependencies, duplicate logic, or system-specific exceptions to explain results.
Practitioner guidance
- Inventory every derived signal and its dependencies Document where each aggregate feature comes from, which inputs it consumes, and which scoring or aggregation systems reuse it.
- Standardise signal definitions across execution modes Use a single composition layer so batch and realtime paths do not each evolve their own interpretation of the same behavioural feature.
- Check for leaky abstractions before scaling detection pipelines Look for places where engineers must remember different rules for different systems just to preserve one analytical outcome.
What's in the full article
Abnormal AI's full blog post covers the operational detail this post intentionally leaves for the source:
- How the Signals DAG executor is implemented in Python and how its primitives shape feature composition
- The realtime Kafka consumer flow, including how upsert instructions move into the Go-based storage service
- The batch processing pattern built on Spark and Airflow for 3TB daily workloads
- The rationale behind Redis, Kafka, and Spark as storage, streaming, and batch components
👉 Read Abnormal AI's Signals DAG architecture analysis for production ML systems →
Signals DAG architecture: what it means for ML pipeline governance?
Explore further
Explicit dependency modelling is now a governance control, not just an engineering preference. The Signals DAG approach shows that large detection systems break down when feature lineage is implicit. Once outputs depend on hidden execution order, teams lose the ability to explain changes, compare runs, or trust that scoring and aggregation saw the same inputs. For practitioners, the lesson is that declarative dependency mapping is a prerequisite for auditability in any high-scale security analytics pipeline.
A few things that frame the scale:
- Only 44% of organisations are currently using a dedicated secrets management system, according to The 2024 State of Secrets Management Survey.
- 54% of organisations are dissatisfied with their current secrets management solution because not all secrets are secured, and 43% cite lack of central management.
A question worth separating out:
Q: What is the difference between a shared signal definition and duplicated implementation?
A: A shared signal definition creates one authoritative description of how a feature is derived, while duplicated implementation creates multiple versions that can diverge over time. The first supports consistency and auditability. The second increases maintenance cost and makes scoring outcomes depend on which system processed the data.
👉 Read our full editorial: Signals DAG architecture shows why ML pipelines need explicit dependencies