Topic:Clustering Multivariate Time Series
What is Clustering Multivariate Time Series? Clustering multivariate time-series is the process of grouping similar time-series data with more than one timestamped variable based on their patterns and characteristics.
Papers and Code
Sep 18, 2025
Abstract:Directional forecasting in financial markets requires both accuracy and interpretability. Before the advent of deep learning, interpretable approaches based on human-defined patterns were prevalent, but their structural vagueness and scale ambiguity hindered generalization. In contrast, deep learning models can effectively capture complex dynamics, yet often offer limited transparency. To bridge this gap, we propose a two-stage framework that integrates unsupervised pattern extracion with interpretable forecasting. (i) SIMPC segments and clusters multivariate time series, extracting recurrent patterns that are invariant to amplitude scaling and temporal distortion, even under varying window sizes. (ii) JISC-Net is a shapelet-based classifier that uses the initial part of extracted patterns as input and forecasts subsequent partial sequences for short-term directional movement. Experiments on Bitcoin and three S&P 500 equities demonstrate that our method ranks first or second in 11 out of 12 metric--dataset combinations, consistently outperforming baselines. Unlike conventional deep learning models that output buy-or-sell signals without interpretable justification, our approach enables transparent decision-making by revealing the underlying pattern structures that drive predictive outcomes.
* 10 pages, 7 figures, accepted at ACM CIKM 2025 conference
Via

Sep 18, 2025
Abstract:Stochastic forecasting is critical for efficient decision-making in uncertain systems, such as energy markets and finance, where estimating the full distribution of future scenarios is essential. We propose Diffusion Scenario Tree (DST), a general framework for constructing scenario trees for multivariate prediction tasks using diffusion-based probabilistic forecasting models. DST recursively samples future trajectories and organizes them into a tree via clustering, ensuring non-anticipativity (decisions depending only on observed history) at each stage. We evaluate the framework on the optimization task of energy arbitrage in New York State's day-ahead electricity market. Experimental results show that our approach consistently outperforms the same optimization algorithms that use scenario trees from more conventional models and Model-Free Reinforcement Learning baselines. Furthermore, using DST for stochastic optimization yields more efficient decision policies, achieving higher performance by better handling uncertainty than deterministic and stochastic MPC variants using the same diffusion-based forecaster.
* 5 pages, 2 figures, 2 tables, and 1 algorithm. This version is
submitted to the 51st IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 2026), to be held in Barcelona, Spain, on May 4-8,
2026
Via

Sep 16, 2025
Abstract:TimeCluster is a visual analytics technique for discovering structure in long multivariate time series by projecting overlapping windows of data into a low-dimensional space. We show that, when Principal Component Analysis (PCA) is chosen as the dimensionality reduction technique, this procedure is mathematically equivalent to classical linear subspace identification (block-Hankel matrix plus Singular Vector Decomposition (SVD)). In both approaches, the same low-dimensional linear subspace is extracted from the time series data. We first review the TimeCluster method and the theory of subspace system identification. Then we show that forming the sliding-window matrix of a time series yields a Hankel matrix, so applying PCA (via SVD) to this matrix recovers the same principal directions as subspace identification. Thus the cluster coordinates from TimeCluster coincide with the subspace identification methods. We present experiments on synthetic and real dynamical signals confirming that the two embeddings coincide. Finally, we explore and discuss future opportunities enabled by this equivalence, including forecasting from the identified state space, streaming/online extensions, incorporating and visualising external inputs and robust techniques for displaying underlying trends in corrupted data.
* 15 pages, 9 figures
Via

Jul 22, 2025
Abstract:Clustering of multivariate time series using correlation-based methods reveals regime changes in relationships between variables across health, finance, and industrial applications. However, validating whether discovered clusters represent distinct relationships rather than arbitrary groupings remains a fundamental challenge. Existing clustering validity indices were developed for Euclidean data, and their effectiveness for correlation patterns has not been systematically evaluated. Unlike Euclidean clustering, where geometric shapes provide discrete reference targets, correlations exist in continuous space without equivalent reference patterns. We address this validation gap by introducing canonical correlation patterns as mathematically defined validation targets that discretise the infinite correlation space into finite, interpretable reference patterns. Using synthetic datasets with perfect ground truth across controlled conditions, we demonstrate that canonical patterns provide reliable validation targets, with L1 norm for mapping and L5 norm for silhouette width criterion and Davies-Bouldin index showing superior performance. These methods are robust to distribution shifts and appropriately detect correlation structure degradation, enabling practical implementation guidelines. This work establishes a methodological foundation for rigorous correlation-based clustering validation in high-stakes domains.
* 45 pages, 8 figures. Introduces canonical correlation patterns as
discrete validation targets for correlation-based clustering, systematically
evaluates distance functions and validity indices, and provides practical
implementation guidelines through controlled experiments with synthetic
ground truth data
Via

May 12, 2025
Abstract:Clustering multivariate time series data is a crucial task in many domains, as it enables the identification of meaningful patterns and groups in time-evolving data. Traditional approaches, such as crisp clustering, rely on the assumption that clusters are sufficiently separated with little overlap. However, real-world data often defy this assumption, exhibiting overlapping distributions or overlapping clouds of points and blurred boundaries between clusters. Fuzzy clustering offers a compelling alternative by allowing partial membership in multiple clusters, making it well-suited for these ambiguous scenarios. Despite its advantages, current fuzzy clustering methods primarily focus on univariate time series, and for multivariate cases, even datasets of moderate dimensionality become computationally prohibitive. This challenge is further exacerbated when dealing with time series of varying lengths, leaving a clear gap in addressing the complexities of modern datasets. This work introduces a novel fuzzy clustering approach based on common principal component analysis to address the aforementioned shortcomings. Our method has the advantage of efficiently handling high-dimensional multivariate time series by reducing dimensionality while preserving critical temporal features. Extensive numerical results show that our proposed clustering method outperforms several existing approaches in the literature. An interesting application involving brain signals from different drivers recorded from a simulated driving experiment illustrates the potential of the approach.
Via

May 09, 2025
Abstract:Accurate and efficient multivariate time series (MTS) forecasting is essential for applications such as traffic management and weather prediction, which depend on capturing long-range temporal dependencies and interactions between entities. Existing methods, particularly those based on Transformer architectures, compute pairwise dependencies across all time steps, leading to a computational complexity that scales quadratically with the length of the input. To overcome these challenges, we introduce the Forecaster with Offline Clustering Using Segments (FOCUS), a novel approach to MTS forecasting that simplifies long-range dependency modeling through the use of prototypes extracted via offline clustering. These prototypes encapsulate high-level events in the real-world system underlying the data, summarizing the key characteristics of similar time segments. In the online phase, FOCUS dynamically adapts these patterns to the current input and captures dependencies between the input segment and high-level events, enabling both accurate and efficient forecasting. By identifying prototypes during the offline clustering phase, FOCUS reduces the computational complexity of modeling long-range dependencies in the online phase to linear scaling. Extensive experiments across diverse benchmarks demonstrate that FOCUS achieves state-of-the-art accuracy while significantly reducing computational costs.
Via

May 20, 2025
Abstract:Time series clustering promises to uncover hidden structural patterns in data with applications across healthcare, finance, industrial systems, and other critical domains. However, without validated ground truth information, researchers cannot objectively assess clustering quality or determine whether poor results stem from absent structures in the data, algorithmic limitations, or inappropriate validation methods, raising the question whether clustering is "more art than science" (Guyon et al., 2009). To address these challenges, we introduce CSTS (Correlation Structures in Time Series), a synthetic benchmark for evaluating the discovery of correlation structures in multivariate time series data. CSTS provides a clean benchmark that enables researchers to isolate and identify specific causes of clustering failures by differentiating between correlation structure deterioration and limitations of clustering algorithms and validation methods. Our contributions are: (1) a comprehensive benchmark for correlation structure discovery with distinct correlation structures, systematically varied data conditions, established performance thresholds, and recommended evaluation protocols; (2) empirical validation of correlation structure preservation showing moderate distortion from downsampling and minimal effects from distribution shifts and sparsification; and (3) an extensible data generation framework enabling structure-first clustering evaluation. A case study demonstrates CSTS's practical utility by identifying an algorithm's previously undocumented sensitivity to non-normal distributions, illustrating how the benchmark enables precise diagnosis of methodological limitations. CSTS advances rigorous evaluation standards for correlation-based time series clustering.
* 9 pages main + 32 pages total, 2 figures main + 6 figures appendix, 1
table main + 17 tables appendix, dataset available at
https://huggingface.co/datasets/idegen/csts, code available at
https://github.com/isabelladegen/corrclust-validation
Via

Apr 24, 2025
Abstract:Background and Objectives: Multidrug Resistance (MDR) is a critical global health issue, causing increased hospital stays, healthcare costs, and mortality. This study proposes an interpretable Machine Learning (ML) framework for MDR prediction, aiming for both accurate inference and enhanced explainability. Methods: Patients are modeled as Multivariate Time Series (MTS), capturing clinical progression and patient-to-patient interactions. Similarity among patients is quantified using MTS-based methods: descriptive statistics, Dynamic Time Warping, and Time Cluster Kernel. These similarity measures serve as inputs for MDR classification via Logistic Regression, Random Forest, and Support Vector Machines, with dimensionality reduction and kernel transformations improving model performance. For explainability, patient similarity networks are constructed from these metrics. Spectral clustering and t-SNE are applied to identify MDR-related subgroups and visualize high-risk clusters, enabling insight into clinically relevant patterns. Results: The framework was validated on ICU Electronic Health Records from the University Hospital of Fuenlabrada, achieving an AUC of 81%. It outperforms baseline ML and deep learning models by leveraging graph-based patient similarity. The approach identifies key risk factors -- prolonged antibiotic use, invasive procedures, co-infections, and extended ICU stays -- and reveals clinically meaningful clusters. Code and results are available at \https://github.com/oscarescuderoarnanz/DM4MTS. Conclusions: Patient similarity representations combined with graph-based analysis provide accurate MDR prediction and interpretable insights. This method supports early detection, risk factor identification, and patient stratification, highlighting the potential of explainable ML in critical care.
Via

Dec 13, 2024
Abstract:Gravitational-wave observatories like LIGO are large-scale, terrestrial instruments housed in infrastructure that spans a multi-kilometer geographic area and which must be actively controlled to maintain operational stability for long observation periods. Despite exquisite seismic isolation, they remain susceptible to seismic noise and other terrestrial disturbances that can couple undesirable vibrations into the instrumental infrastructure, potentially leading to control instabilities or noise artifacts in the detector output. It is, therefore, critical to characterize the seismic state of these observatories to identify a set of temporal patterns that can inform the detector operators in day-to-day monitoring and diagnostics. On a day-to-day basis, the operators monitor several seismically relevant data streams to diagnose operational instabilities and sources of noise using some simple empirically-determined thresholds. It can be untenable for a human operator to monitor multiple data streams in this manual fashion and thus a distillation of these data-streams into a more human-friendly format is sought. In this paper, we present an end-to-end machine learning pipeline for features-based multivariate time series clustering to achieve this goal and to provide actionable insights to the detector operators by correlating found clusters with events of interest in the detector.
* 8 pages, 6 figures, Accepted to The 5th International Workshop on Big
Data & AI Tools, Methods, and Use Cases for Innovative Scientific Discovery
(BTSD 2024)
Via

Jan 13, 2025
Abstract:Detecting and classifying abnormal system states is critical for condition monitoring, but supervised methods often fall short due to the rarity of anomalies and the lack of labeled data. Therefore, clustering is often used to group similar abnormal behavior. However, evaluating cluster quality without ground truth is challenging, as existing measures such as the Silhouette Score (SSC) only evaluate the cohesion and separation of clusters and ignore possible prior knowledge about the data. To address this challenge, we introduce the Synchronized Anomaly Agreement Index (SAAI), which exploits the synchronicity of anomalies across multivariate time series to assess cluster quality. We demonstrate the effectiveness of SAAI by showing that maximizing SAAI improves accuracy on the task of finding the true number of anomaly classes K in correlated time series by 0.23 compared to SSC and by 0.32 compared to X-Means. We also show that clusters obtained by maximizing SAAI are easier to interpret compared to SSC.
* Acccepted at AAAI Workshop on AI for Time Series Analysis (AI4TS)
2025
Via
