Gait recognition is a biometric technique that identifies individuals by their unique walking styles, which is suitable for unconstrained environments and has a wide range of applications. While current methods focus on exploiting body part-based representations, they often neglect the hierarchical dependencies between local motion patterns. In this paper, we propose a hierarchical spatio-temporal representation learning (HSTL) framework for extracting gait features from coarse to fine. Our framework starts with a hierarchical clustering analysis to recover multi-level body structures from the whole body to local details. Next, an adaptive region-based motion extractor (ARME) is designed to learn region-independent motion features. The proposed HSTL then stacks multiple ARMEs in a top-down manner, with each ARME corresponding to a specific partition level of the hierarchy. An adaptive spatio-temporal pooling (ASTP) module is used to capture gait features at different levels of detail to perform hierarchical feature mapping. Finally, a frame-level temporal aggregation (FTA) module is employed to reduce redundant information in gait sequences through multi-scale temporal downsampling. Extensive experiments on CASIA-B, OUMVLP, GREW, and Gait3D datasets demonstrate that our method outperforms the state-of-the-art while maintaining a reasonable balance between model accuracy and complexity.
Gait recognition aims at identifying individual-specific walking patterns, which is highly dependent on the observation of the different periodic movements of each body part. However, most existing methods treat each part equally and neglect the data redundancy due to the high sampling rate of gait sequences. In this work, we propose a fine-grained motion representation network (GaitFM) to improve gait recognition performance in three aspects. First, a fine-grained part sequence learning (FPSL) module is designed to explore part-independent spatio-temporal representations. Secondly, a frame-wise compression strategy, called local motion aggregation (LMA), is used to enhance motion variations. Finally, a weighted generalized mean pooling (WGeM) layer works to adaptively keep more discriminative information in the spatial downsampling. Experiments on two public datasets, CASIA-B and OUMVLP, show that our approach reaches state-of-the-art performances. On the CASIA-B dataset, our method achieves rank-1 accuracies of 98.0%, 95.7% and 87.9% for normal walking, walking with a bag and walking with a coat, respectively. On the OUMVLP dataset, our method achieved a rank-1 accuracy of 90.5%.