Crowd counting is the process of estimating the number of people in a crowd from images or videos.
Crowd counting plays a vital role in public safety, traffic regulation, and smart city management. However, despite the impressive progress achieved by CNN- and Transformer-based models, their performance often deteriorates when applied across diverse environments due to severe domain discrepancies. Direct joint training on multiple datasets, which intuitively should enhance generalization, instead results in negative transfer, as shared and domain-specific representations become entangled. To address this challenge, we propose the Feature Separation and Cross-Attention Network FSCA-Net, a unified framework that explicitly disentangles feature representations into domain-invariant and domain-specific components. A novel cross-attention fusion module adaptively models interactions between these components, ensuring effective knowledge transfer while preserving dataset-specific discriminability. Furthermore, a mutual information optimization objective is introduced to maximize consistency among domain-invariant features and minimize redundancy among domain-specific ones, promoting complementary shared-private representations. Extensive experiments on multiple crowd counting benchmarks demonstrate that FSCA-Net effectively mitigates negative transfer and achieves state-of-the-art cross-dataset generalization, providing a robust and scalable solution for real-world crowd analysis.
Crowd counting remains challenging in variable-density scenes due to scale variations, occlusions, and the high computational cost of existing models. To address these issues, we propose RepSFNet (Reparameterized Single Fusion Network), a lightweight architecture designed for accurate and real-time crowd estimation. RepSFNet leverages a RepLK-ViT backbone with large reparameterized kernels for efficient multi-scale feature extraction. It further integrates a Feature Fusion module combining Atrous Spatial Pyramid Pooling (ASPP) and Context-Aware Network (CAN) to achieve robust, density-adaptive context modeling. A Concatenate Fusion module is employed to preserve spatial resolution and generate high-quality density maps. By avoiding attention mechanisms and multi-branch designs, RepSFNet significantly reduces parameters and computational complexity. The training objective combines Mean Squared Error and Optimal Transport loss to improve both count accuracy and spatial distribution alignment. Experiments conducted on ShanghaiTech, NWPU, and UCF-QNRF datasets demonstrate that RepSFNet achieves competitive accuracy while reducing inference latency by up to 34 percent compared to recent state-of-the-art methods, making it suitable for real-time and low-power edge computing applications.
Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.
Counting and tracking dense crowds in large-scale scenes is highly challenging, yet existing methods mainly rely on datasets captured by fixed cameras, which provide limited spatial coverage and are inadequate for large-scale dense crowd analysis. To address this limitation, we propose a flexible solution using moving drones to capture videos and perform video-level crowd counting and tracking of unique pedestrians across entire scenes. We introduce MovingDroneCrowd++, the largest video-level dataset for dense crowd counting and tracking captured by moving drones, covering diverse and complex conditions with varying flight altitudes, camera angles, and illumination. Existing methods fail to achieve satisfactory performance on this dataset. To this end, we propose GD3A (Global Density Map Decomposition via Descriptor Association), a density map-based video individual counting method that avoids explicit localization. GD3A establishes pixel-level correspondences between pedestrian descriptors across consecutive frames via optimal transport with an adaptive dustbin score, enabling the decomposition of global density maps into shared, inflow, and outflow components. Building on this framework, we further introduce DVTrack, which converts descriptor-level matching into instance-level associations through a descriptor voting mechanism for pedestrian tracking. Experimental results show that our methods significantly outperform existing approaches under dense crowds and complex motion, reducing counting error by 47.4 percent and improving tracking performance by 39.2 percent.
Device-free crowd-counting using WiFi Channel State Information (CSI) is a key enabling technology for a new generation of privacy-preserving Internet of Things (IoT) applications. However, practical deployment is severely hampered by the domain shift problem, where models trained in one environment fail to generalise to another. To overcome this, we propose a novel two-stage framework centred on a CSI-ResNet-A architecture. This model is pre-trained via self-supervised contrastive learning to learn domain-invariant representations and leverages lightweight Adapter modules for highly efficient fine-tuning. The resulting event sequence is then processed by a stateful counting machine to produce a final, stable occupancy estimate. We validate our framework extensively. On our WiFlow dataset, our unsupervised approach excels in a 10-shot learning scenario, achieving a final Mean Absolute Error (MAE) of just 0.44--a task where supervised baselines fail. To formally quantify robustness, we introduce the Generalisation Index (GI), on which our model scores near-perfectly, confirming its ability to generalise. Furthermore, our framework sets a new state-of-the-art public WiAR benchmark with 98.8\% accuracy. Our ablation studies reveal the core strength of our design: adapter-based fine-tuning achieves performance within 1\% of a full fine-tune (98.84\% vs. 99.67\%) while training 97.2\% fewer parameters. Our work provides a practical and scalable solution for developing robust sensing systems ready for real-world IoT deployments.
Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video. It extends Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that learns to count pedestrians across frames, VIC must identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, can underperform in congested scenes such as metro commuting. To address this, we build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded, dynamic pedestrian flows. It features sparse-to-dense density levels, short-to-long video clips, slow-to-fast flow variations, front-to-back appearance changes, and light-to-heavy occlusions. To better adapt VIC approaches to crowds, we rethink the nature of VIC and recognize two informative priors: i) the social grouping prior that indicates pedestrians tend to gather in groups and ii) the spatial-temporal displacement prior that informs an individual cannot teleport physically. The former inspires us to relax the standard one-to-one (O2O) matching used by VIC to one-to-many (O2M) matching, implemented by an implicit context generator and a O2M matcher; the latter facilitates the design of a displacement prior injector, which strengthens not only O2M matching but also feature extraction and model training. These designs jointly form a novel and strong VIC baseline OMAN++. Extensive experiments show that OMAN++ not only outperforms state-of-the-art VIC baselines on the standard SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, but also indicates a clear advantage in crowded scenes, with a 38.12% error reduction on our WuhanMetroCrowd dataset. Code, data, and pretrained models are available at https://github.com/tiny-smart/OMAN.
Crowd counting typically relies on labor-intensive point-level annotations and computationally intensive backbones, restricting its scalability and deployment in resource-constrained environments. To address these challenges, this paper proposes the TCFormer, a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance. Firstly, a powerful yet efficient vision transformer is adopted as the feature extractor, the global context-aware capabilities of which provides semantic meaningful crowd features with a minimal memory footprint. Secondly, to compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module. This module dynamically re-weights local tokens according to predicted density scores, enabling the network to adaptively modulate regional features based on their specific density characteristics without the need for additional annotations. Furthermore, this paper introduces a density-level classification loss, which discretizes crowd density into distinct grades, thereby regularizing the training process and enhancing the model's classification power across varying levels of crowd density. Therefore, although TCformer is trained under a weakly-supervised paradigm utilizing only image-level global counts, the joint optimization of count and density-level losses enables the framework to achieve high estimation accuracy. Extensive experiments on four benchmarks including ShanghaiTech A/B, UCF-QNRF, and NWPU datasets demonstrate that our approach strikes a superior trade-off between parameter efficiency and counting accuracy and can be a good solution for crowd counting tasks in edge devices.




Multi-view crowd counting has been proposed to deal with the severe occlusion issue of crowd counting in large and wide scenes. However, due to the difficulty of collecting and annotating multi-view images, the datasets for multi-view counting have a limited number of multi-view frames and scenes. To solve the problem of limited data, one approach is to collect synthetic data to bypass the annotating step, while another is to propose semi- or weakly-supervised or unsupervised methods that demand less multi-view data. In this paper, we propose two semi-supervised multi-view crowd counting frameworks by ranking the multi-view fusion models of different numbers of input views, in terms of the model predictions or the model uncertainties. Specifically, for the first method (vanilla model), we rank the multi-view fusion models' prediction results of different numbers of camera-view inputs, namely, the model's predictions with fewer camera views shall not be larger than the predictions with more camera views. For the second method, we rank the estimated model uncertainties of the multi-view fusion models with a variable number of view inputs, guided by the multi-view fusion models' prediction errors, namely, the model uncertainties with more camera views shall not be larger than those with fewer camera views. These constraints are introduced into the model training in a semi-supervised fashion for multi-view counting with limited labeled data. The experiments demonstrate the advantages of the proposed multi-view model ranking methods compared with other semi-supervised counting methods.




With the advent of the 6G era, Integrated Sensing and Communications (ISAC) has attracted increasing attention. One representative of use cases is crowd flow estimation on outdoor streets. However, most existing studies have focused on indoor environments or vehicles, and demonstrations of outdoor crowd flow estimation using commercial LTE base station remain limited. This study addresses this use case and proposes an analysis of a crowd flow estimation method using Reference Signal Received Power (RSRP) obtained from a commercial LTE base station. Specifically, pedestrian counts derived from a camera-based object recognition algorithm were associated with the variance of RSRP. The features obtained from the variance were quantitatively evaluated by combining a CatBoost regression model with SHapley Additive exPlanations (SHAP) analysis. Through this investigation, we clarified that an optimal variance window size for RSRP is 0.1 to 0.2 seconds and that enlarging the counting area increased the features obtained from the variance of RSRP, for machine learning. Consequently, this study is the first to quantitatively demonstrate the effectiveness of outdoor crowd flow estimation using commercial LTE, while also revealing the characteristic behavior of variance window size and counting area size in feature design.




This paper introduces Gaussian Spatial Transport (GST), a novel framework that leverages Gaussian splatting to facilitate transport from the probability measure in the image coordinate space to the annotation map. We propose a Gaussian splatting-based method to estimate pixel-annotation correspondence, which is then used to compute a transport plan derived from Bayesian probability. To integrate the resulting transport plan into standard network optimization in typical computer vision tasks, we derive a loss function that measures discrepancy after transport. Extensive experiments on representative computer vision tasks, including crowd counting and landmark detection, validate the effectiveness of our approach. Compared to conventional optimal transport schemes, GST eliminates iterative transport plan computation during training, significantly improving efficiency. Code is available at https://github.com/infinite0522/GST.