



With the continuous development of deep learning, the field of repetitive action counting is gradually gaining notice from many researchers. Extraction of pose keypoints using human pose estimation networks is proven to be an effective pose-level method. However, existing pose-level methods suffer from the shortcomings that the single coordinate is not stable enough to handle action distortions due to changes in camera viewpoints, thus failing to accurately identify salient poses, and is vulnerable to misdetection during the transition from the exception to the actual action. To overcome these problems, we propose a simple but efficient Global Multi-geometric Feature Learning Network (GMFL-Net). Specifically, we design a MIA-Module that aims to improve information representation by fusing multi-geometric features, and learning the semantic similarity among the input multi-geometric features. Then, to improve the feature representation from a global perspective, we also design a GBFL-Module that enhances the inter-dependencies between point-wise and channel-wise elements and combines them with the rich local information generated by the MIA-Module to synthesise a comprehensive and most representative global feature representation. In addition, considering the insufficient existing dataset, we collect a new dataset called Countix-Fitness-pose (https://github.com/Wantong66/Countix-Fitness) which contains different cycle lengths and exceptions, a test set with longer duration, and annotate it with fine-grained annotations at the pose-level. We also add two new action classes, namely lunge and rope push-down. Finally, extensive experiments on the challenging RepCount-pose, UCFRep-pose, and Countix-Fitness-pose benchmarks show that our proposed GMFL-Net achieves state-of-the-art performance.




Counting repetitive actions in long untrimmed videos is a challenging task that has many applications such as rehabilitation. State-of-the-art methods predict action counts by first generating a temporal self-similarity matrix (TSM) from the sampled frames and then feeding the matrix to a predictor network. The self-similarity matrix, however, is not an optimal input to a network since it discards too much information from the frame-wise embeddings. We thus rethink how a TSM can be utilized for counting repetitive actions and propose a framework that learns embeddings and predicts action start probabilities at full temporal resolution. The number of repeated actions is then inferred from the action start probabilities. In contrast to current approaches that have the TSM as an intermediate representation, we propose a novel loss based on a generated reference TSM, which enforces that the self-similarity of the learned frame-wise embeddings is consistent with the self-similarity of repeated actions. The proposed framework achieves state-of-the-art results on three datasets, i.e., RepCount, UCFRep, and Countix.




Repetitive action counting quantifies the frequency of specific actions performed by individuals. However, existing action-counting datasets have limited action diversity, potentially hampering model performance on unseen actions. To address this issue, we propose a framework called First Cycle Annotated Repetitive Action Counting (FCA-RAC). This framework contains 4 parts: 1) a labeling technique that annotates each training video with the start and end of the first action cycle, along with the total action count. This technique enables the model to capture the correlation between the initial action cycle and subsequent actions; 2) an adaptive sampling strategy that maximizes action information retention by adjusting to the speed of the first annotated action cycle in videos; 3) a Multi-Temporal Granularity Convolution (MTGC) module, that leverages the muli-scale first action as a kernel to convolve across the entire video. This enables the model to capture action variations at different time scales within the video; 4) a strategy called Training Knowledge Augmentation (TKA) that exploits the annotated first action cycle information from the entire dataset. This allows the network to harness shared characteristics across actions effectively, thereby enhancing model performance and generalizability to unseen actions. Experimental results demonstrate that our approach achieves superior outcomes on RepCount-A and related datasets, highlighting the efficacy of our framework in improving model performance on seen and unseen actions. Our paper makes significant contributions to the field of action counting by addressing the limitations of existing datasets and proposing novel techniques for improving model generalizability.




The key to action counting is accurately locating each video's repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the general action pattern initially, followed by a finer, frame-by-frame focus to determine if it aligns with the target action. Specifically, SkimFocusNet incorporates a skim branch and a focus branch. The skim branch scans the global contextual information throughout the sequence to identify potential target action for guidance. Subsequently, the focus branch utilizes the guidance to diligently identify repetitive actions using a long-short adaptive guidance (LSAG) block. Additionally, we have observed that videos in existing datasets often feature only one type of repetitive action, which inadequately represents real-world scenarios. To more accurately describe real-life situations, we establish the Multi-RepCount dataset, which includes videos containing multiple repetitive motions. On Multi-RepCount, our SkimFoucsNet can perform specified action counting, that is, to enable counting a particular action type by referencing an exemplary video. This capability substantially exhibits the robustness of our method. Extensive experiments demonstrate that SkimFocusNet achieves state-of-the-art performances with significant improvements. We also conduct a thorough ablation study to evaluate the network components. The source code will be published upon acceptance.




Video repetition counting infers the number of repetitions of recurring actions or motion within a video. We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos. Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos. In training, ESCounts regresses locations of high correspondence to the exemplars within the video. In tandem, our method learns a latent that encodes representations of general repetitive motions, which we use for exemplar-free, zero-shot inference. Extensive experiments over commonly used datasets (RepCount, Countix, and UCFRep) showcase ESCounts obtaining state-of-the-art performance across all three datasets. On RepCount, ESCounts increases the off-by-one from 0.39 to 0.56 and decreases the mean absolute error from 0.38 to 0.21. Detailed ablations further demonstrate the effectiveness of our method.
Video Action Counting (VAC) is crucial in analyzing sports, fitness, and everyday activities by quantifying repetitive actions in videos. However, traditional VAC methods have overlooked the complexity of action repetitions, such as interruptions and the variability in cycle duration. Our research addresses the shortfall by introducing a novel approach to VAC, called Irregular Video Action Counting (IVAC). IVAC prioritizes modeling irregular repetition patterns in videos, which we define through two primary aspects: Inter-cycle Consistency and Cycle-interval Inconsistency. Inter-cycle Consistency ensures homogeneity in the spatial-temporal representations of cycle segments, signifying action uniformity within cycles. Cycle-interval inconsistency highlights the importance of distinguishing between cycle segments and intervals based on their inherent content differences. To encapsulate these principles, we propose a new methodology that includes consistency and inconsistency modules, supported by a unique pull-push loss (P2L) mechanism. The IVAC-P2L model applies a pull loss to promote coherence among cycle segment features and a push loss to clearly distinguish features of cycle segments from interval segments. Empirical evaluations conducted on the RepCount dataset demonstrate that the IVAC-P2L model sets a new benchmark in VAC task performance. Furthermore, the model demonstrates exceptional adaptability and generalization across various video contents, outperforming existing models on two additional datasets, UCFRep and Countix, without the need for dataset-specific optimization. These results confirm the efficacy of our approach in addressing irregular repetitions in videos and pave the way for further advancements in video analysis and understanding.




Repetitive counting (RepCount) is critical in various applications, such as fitness tracking and rehabilitation. Previous methods have relied on the estimation of red-green-and-blue (RGB) frames and body pose landmarks to identify the number of action repetitions, but these methods suffer from a number of issues, including the inability to stably handle changes in camera viewpoints, over-counting, under-counting, difficulty in distinguishing between sub-actions, inaccuracy in recognizing salient poses, etc. In this paper, based on the work done by [1], we integrate joint angles with body pose landmarks to address these challenges and achieve better results than the state-of-the-art RepCount methods, with a Mean Absolute Error (MAE) of 0.211 and an Off-By-One (OBO) counting accuracy of 0.599 on the RepCount data set [2]. Comprehensive experimental results demonstrate the effectiveness and robustness of our method.