Topic:Skeleton Based Action Recognition
What is Skeleton Based Action Recognition? Skeleton-based Action Recognition is a computer-vision task that involves recognizing human actions from a sequence of 3D skeletal joint data captured from sensors such as Microsoft Kinect, Intel RealSense, and wearable devices. The goal of skeleton-based action recognition is to develop algorithms that can understand and classify human actions from skeleton data, which can be used in various applications such as human-computer interaction, sports analysis, and surveillance.
Papers and Code
Apr 28, 2024
Abstract:Skeleton-based action recognition is vital for comprehending human-centric videos and has applications in diverse domains. One of the challenges of skeleton-based action recognition is dealing with low-quality data, such as skeletons that have missing or inaccurate joints. This paper addresses the issue of enhancing action recognition using low-quality skeletons through a general knowledge distillation framework. The proposed framework employs a teacher-student model setup, where a teacher model trained on high-quality skeletons guides the learning of a student model that handles low-quality skeletons. To bridge the gap between heterogeneous high-quality and lowquality skeletons, we present a novel part-based skeleton matching strategy, which exploits shared body parts to facilitate local action pattern learning. An action-specific part matrix is developed to emphasize critical parts for different actions, enabling the student model to distill discriminative part-level knowledge. A novel part-level multi-sample contrastive loss achieves knowledge transfer from multiple high-quality skeletons to low-quality ones, which enables the proposed knowledge distillation framework to include training low-quality skeletons that lack corresponding high-quality matches. Comprehensive experiments conducted on the NTU-RGB+D, Penn Action, and SYSU 3D HOI datasets demonstrate the effectiveness of the proposed knowledge distillation framework.
* published in Signal Processing 2024
Via

May 31, 2024
Abstract:Human action recognition is a crucial task in computer vision systems. However, in real-world scenarios, human actions often fall outside the distribution of training data, requiring a model to both recognize in-distribution (ID) actions and reject out-of-distribution (OOD) ones. Despite its importance, there has been limited research on OOD detection in human actions. Existing works on OOD detection mainly focus on image data with RGB structure, and many methods are post-hoc in nature. While these methods are convenient and computationally efficient, they often lack sufficient accuracy and fail to consider the presence of OOD samples. To address these challenges, we propose a novel end-to-end skeleton-based model called Action-OOD, specifically designed for OOD human action detection. Unlike some existing approaches that may require prior knowledge of existing OOD data distribution, our model solely utilizes in-distribution (ID) data during the training stage, effectively mitigating the overconfidence issue prevalent in OOD detection. We introduce an attention-based feature fusion block, which enhances the model's capability to recognize unknown classes while preserving classification accuracy for known classes. Further, we present a novel energy-based loss function and successfully integrate it with the traditional cross-entropy loss to maximize the separation of data distributions between ID and OOD. Through extensive experiments conducted on NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics-400 datasets, we demonstrate the superior performance of our proposed approach compared to state-of-the-art methods. Our findings underscore the effectiveness of classic OOD detection techniques in the context of skeleton-based action recognition tasks, offering promising avenues for future research in this field. Code will be available at: https://github.com/YilliaJing/Action-OOD.git.
* Under consideration at Computer Vision and Image Understanding
Via

Jun 30, 2024
Abstract:Existing Graph Neural Networks (GNNs) are limited to process graphs each of whose vertices is represented by a vector or a single value, limited their representing capability to describe complex objects. In this paper, we propose the first GNN (called Graph in Graph Neural (GIG) Network) which can process graph-style data (called GIG sample) whose vertices are further represented by graphs. Given a set of graphs or a data sample whose components can be represented by a set of graphs (called multi-graph data sample), our GIG network starts with a GIG sample generation (GSG) module which encodes the input as a \textbf{GIG sample}, where each GIG vertex includes a graph. Then, a set of GIG hidden layers are stacked, with each consisting of: (1) a GIG vertex-level updating (GVU) module that individually updates the graph in every GIG vertex based on its internal information; and (2) a global-level GIG sample updating (GGU) module that updates graphs in all GIG vertices based on their relationships, making the updated GIG vertices become global context-aware. This way, both internal cues within the graph contained in each GIG vertex and the relationships among GIG vertices could be utilized for down-stream tasks. Experimental results demonstrate that our GIG network generalizes well for not only various generic graph analysis tasks but also real-world multi-graph data analysis (e.g., human skeleton video-based action recognition), which achieved the new state-of-the-art results on 13 out of 14 evaluated datasets. Our code is publicly available at https://github.com/wangjs96/Graph-in-Graph-Neural-Network.
Via

May 31, 2024
Abstract:Supervised and self-supervised learning are two main training paradigms for skeleton-based human action recognition. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C$^2$VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal contrastive process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments show that our method achieves state-of-the-art results on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The code will be available in the future.
Via

Apr 30, 2024
Abstract:Human action video recognition has recently attracted more attention in applications such as video security and sports posture correction. Popular solutions, including graph convolutional networks (GCNs) that model the human skeleton as a spatiotemporal graph, have proven very effective. GCNs-based methods with stacked blocks usually utilize top-layer semantics for classification/annotation purposes. Although the global features learned through the procedure are suitable for the general classification, they have difficulty capturing fine-grained action change across adjacent frames -- decisive factors in sports actions. In this paper, we propose a novel ``Cross-block Fine-grained Semantic Cascade (CFSC)'' module to overcome this challenge. In summary, the proposed CFSC progressively integrates shallow visual knowledge into high-level blocks to allow networks to focus on action details. In particular, the CFSC module utilizes the GCN feature maps produced at different levels, as well as aggregated features from proceeding levels to consolidate fine-grained features. In addition, a dedicated temporal convolution is applied at each level to learn short-term temporal features, which will be carried over from shallow to deep layers to maximize the leverage of low-level details. This cross-block feature aggregation methodology, capable of mitigating the loss of fine-grained information, has resulted in improved performance. Last, FD-7, a new action recognition dataset for fencing sports, was collected and will be made publicly available. Experimental results and empirical analysis on public benchmarks (FSD-10) and self-collected (FD-7) demonstrate the advantage of our CFSC module on learning discriminative patterns for action classification over others.
Via

Mar 15, 2024
Abstract:Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study will be made accessible at https://github.com/xuyizdby/NoiseEraSAR.
* The source code will be made accessible at
https://github.com/xuyizdby/NoiseEraSAR
Via

Apr 15, 2024
Abstract:Skeleton-based zero-shot action recognition aims to recognize unknown human actions based on the learned priors of the known skeleton-based actions and a semantic descriptor space shared by both known and unknown categories. However, previous works focus on establishing the bridges between the known skeleton representation space and semantic descriptions space at the coarse-grained level for recognizing unknown action categories, ignoring the fine-grained alignment of these two spaces, resulting in suboptimal performance in distinguishing high-similarity action categories. To address these challenges, we propose a novel method via Side information and dual-prompts learning for skeleton-based zero-shot action recognition (STAR) at the fine-grained level. Specifically, 1) we decompose the skeleton into several parts based on its topology structure and introduce the side information concerning multi-part descriptions of human body movements for alignment between the skeleton and the semantic space at the fine-grained level; 2) we design the visual-attribute and semantic-part prompts to improve the intra-class compactness within the skeleton space and inter-class separability within the semantic space, respectively, to distinguish the high-similarity actions. Extensive experiments show that our method achieves state-of-the-art performance in ZSL and GZSL settings on NTU RGB+D, NTU RGB+D 120, and PKU-MMD datasets.
* 11 pages, 5 figures
Via

Mar 31, 2024
Abstract:Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been proposed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowledge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we investigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic projection process to project each input action signal (i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action sentence''). Moreover, we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.
* CVPR 2024
Via

Sep 15, 2024
Abstract:Human Activity Recognition (HAR) systems aim to understand human behaviour and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals. Each modality provides unique and complementary information suited to different application scenarios. Consequently, numerous studies have investigated diverse approaches for HAR using these modalities. This paper presents a comprehensive survey of the latest advancements in HAR from 2014 to 2024, focusing on machine learning (ML) and deep learning (DL) approaches categorized by input data modalities. We review both single-modality and multi-modality techniques, highlighting fusion-based and co-learning frameworks. Additionally, we cover advancements in hand-crafted action features, methods for recognizing human-object interactions, and activity detection. Our survey includes a detailed dataset description for each modality and a summary of the latest HAR systems, offering comparative results on benchmark datasets. Finally, we provide insightful observations and propose effective future research directions in HAR.
Via

Feb 03, 2024
Abstract:Skeleton-based action recognition has attracted much attention, benefiting from its succinctness and robustness. However, the minimal inter-class variation in similar action sequences often leads to confusion. The inherent spatiotemporal coupling characteristics make it challenging to mine the subtle differences in joint motion trajectories, which is critical for distinguishing confusing fine-grained actions. To alleviate this problem, we propose a Wavelet-Attention Decoupling (WAD) module that utilizes discrete wavelet transform to effectively disentangle salient and subtle motion features in the time-frequency domain. Then, the decoupling attention adaptively recalibrates their temporal responses. To further amplify the discrepancies in these subtle motion features, we propose a Fine-grained Contrastive Enhancement (FCE) module to enhance attention towards trajectory features by contrastive learning. Extensive experiments are conducted on the coarse-grained dataset NTU RGB+D and the fine-grained dataset FineGYM. Our methods perform competitively compared to state-of-the-art methods and can discriminate confusing fine-grained actions well.
* Accepted by ICASSP 2024
Via
