Abstract:Causal discovery is increasingly applied to large-scale telemetry data to estimate the effects of user-facing interventions, yet its reliability for decision-making in feedback-driven systems with strong self-selection remains unclear. In this paper, we propose an effect-centric, admissibility-first framework that treats discovered graphs as structural hypotheses and evaluates them by identifiability, stability, and falsification rather than by graph recovery accuracy alone. Empirically, we study the effect of early exposure to competitive gameplay on short-term retention using real-world game telemetry. We find that many statistically plausible discovery outputs do not admit point-identified causal queries once minimal temporal and semantic constraints are enforced, highlighting identifiability as a critical bottleneck for decision support. When identification is possible, several algorithm families converge to similar, decision-consistent effect estimates despite producing substantially different graph structures, including cases where the direct treatment-outcome edge is absent and the effect is preserved through indirect causal pathways. These converging estimates survive placebo, subsampling, and sensitivity refutation. In contrast, other methods exhibit sporadic admissibility and threshold-sensitive or attenuated effects due to endpoint ambiguity. These results suggest that graph-level metrics alone are inadequate proxies for causal reliability for a given target query. Therefore, trustworthy causal conclusions in telemetry-driven systems require prioritizing admissibility and effect-level validation over causal structural recovery alone.
Abstract:Failures in complex systems demand rapid Root Cause Analysis (RCA) to prevent cascading damage. Existing RCA methods that operate without dependency graph typically assume that the root cause having the highest anomaly score. This assumption fails when faults propagate, as a small delay at the root cause can accumulate into a much larger anomaly downstream. In this paper, we propose PRISM, a simple and efficient framework for RCA when the dependency graph is absent. We formulate a class of component-based systems under which PRISM performs RCA with theoretical guarantees. On 735 failures across 9 real-world datasets, PRISM achieves 68% Top-1 accuracy, a 258% improvement over the best baseline, while requiring only 8ms per diagnosis.
Abstract:We propose HyGLAD, a novel algorithm that automatically builds a set of interpretable patterns that model event data. These patterns can then be used to detect event-based anomalies in a stationary system, where any deviation from past behavior may indicate malicious activity. The algorithm infers equivalence classes of entities with similar behavior observed from the events, and then builds regular expressions that capture the values of those entities. As opposed to deep-learning approaches, the regular expressions are directly interpretable, which also translates to interpretable anomalies. We evaluate HyGLAD against all 7 unsupervised anomaly detection methods from DeepOD on five datasets from real-world systems. The experimental results show that on average HyGLAD outperforms existing deep-learning methods while being an order of magnitude more efficient in training and inference (single CPU vs GPU). Precision improved by 1.2x and recall by 1.3x compared to the second-best baseline.