Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yazhou Yao

Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

Apr 14, 2025

Hongyu Qu, Ling Xing, Rui Yan, Yazhou Yao, Guo-Sen Xie, Xiangbo Shu

Abstract:Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations independently for each video by designing various inter-frame temporal modeling strategies. However, they neglect explicit relation modeling between videos and tasks, thus failing to capture shared temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. In addition to conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and learning intra- and inter-class temporal correlations among support features; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

Via

Access Paper or Ask Questions

Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Mar 30, 2025

Junzhu Mao, Yang Shen, Jinyang Guo, Yazhou Yao, Xiansheng Hua

Figure 1 for Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Figure 2 for Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Figure 3 for Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Figure 4 for Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Abstract:Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64$\times$ speed-up with only a 0.2\% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/prune_and_merge.

* accepted by IEEE Transactions on Multimedia

Via

Access Paper or Ask Questions

Twofold Debiasing Enhances Fine-Grained Learning with Coarse Labels

Feb 27, 2025

Xin-yang Zhao, Jian Jin, Yang-yang Li, Yazhou Yao

Figure 1 for Twofold Debiasing Enhances Fine-Grained Learning with Coarse Labels

Figure 2 for Twofold Debiasing Enhances Fine-Grained Learning with Coarse Labels

Figure 3 for Twofold Debiasing Enhances Fine-Grained Learning with Coarse Labels

Figure 4 for Twofold Debiasing Enhances Fine-Grained Learning with Coarse Labels

Abstract:The Coarse-to-Fine Few-Shot (C2FS) task is designed to train models using only coarse labels, then leverages a limited number of subclass samples to achieve fine-grained recognition capabilities. This task presents two main challenges: coarse-grained supervised pre-training suppresses the extraction of critical fine-grained features for subcategory discrimination, and models suffer from overfitting due to biased distributions caused by limited fine-grained samples. In this paper, we propose the Twofold Debiasing (TFB) method, which addresses these challenges through detailed feature enhancement and distribution calibration. Specifically, we introduce a multi-layer feature fusion reconstruction module and an intermediate layer feature alignment module to combat the model's tendency to focus on simple predictive features directly related to coarse-grained supervision, while neglecting complex fine-grained level details. Furthermore, we mitigate the biased distributions learned by the fine-grained classifier using readily available coarse-grained sample embeddings enriched with fine-grained information. Extensive experiments conducted on five benchmark datasets demonstrate the efficacy of our approach, achieving state-of-the-art results that surpass competitive methods.

Via

Access Paper or Ask Questions

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Dec 25, 2024

Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, Errui Ding

Figure 1 for Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Figure 2 for Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Figure 3 for Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Figure 4 for Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Abstract:Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we explore the idea that CV adopts discrete and terminological task definitions (\eg, ``image segmentation''), which may be a key barrier to zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million ``image input $\to$ explanatory instruction $\to$ output'' triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be openly available on our GitHub repository.

* 41 pages

Via

Access Paper or Ask Questions

The Key of Understanding Vision Tasks: Explanatory Instructions

Dec 24, 2024

Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, Errui Ding

Figure 1 for The Key of Understanding Vision Tasks: Explanatory Instructions

Figure 2 for The Key of Understanding Vision Tasks: Explanatory Instructions

Figure 3 for The Key of Understanding Vision Tasks: Explanatory Instructions

Figure 4 for The Key of Understanding Vision Tasks: Explanatory Instructions

* 40 pages

Via

Access Paper or Ask Questions

FTMoMamba: Motion Generation with Frequency and Text State Space Models

Nov 26, 2024

Chengjian Li, Xiangbo Shu, Qiongjie Cui, Yazhou Yao, Jinhui Tang

Abstract:Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble), respectively. To ensure the consistency between text and motion, TextSSM encodes text features at the sentence level, aligning textual semantics with sequential features. Extensive experiments show that FTMoMamba achieves superior performance on the text-to-motion generation task, especially gaining the lowest FID of 0.181 (rather lower than 0.421 of MLD) on the HumanML3D dataset.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Nov 25, 2024

Guangzhao Dai, Jian Zhao, Yuantao Chen, Yusen Qin, Hao Zhao, Guosen Xie, Yazhou Yao, Xiangbo Shu, Xuelong Li

Figure 1 for UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Figure 2 for UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Figure 3 for UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Figure 4 for UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Abstract:Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.

Via

Access Paper or Ask Questions

COMOGen: A Controllable Text-to-3D Multi-object Generation Framework

Sep 01, 2024

Shaorong Sun, Shuchao Pang, Yazhou Yao, Xiaoshui Huang

Figure 1 for COMOGen: A Controllable Text-to-3D Multi-object Generation Framework

Figure 2 for COMOGen: A Controllable Text-to-3D Multi-object Generation Framework

Figure 3 for COMOGen: A Controllable Text-to-3D Multi-object Generation Framework

Figure 4 for COMOGen: A Controllable Text-to-3D Multi-object Generation Framework

Abstract:The controllability of 3D object generation methods is achieved through input text. Existing text-to-3D object generation methods primarily focus on generating a single object based on a single object description. However, these methods often face challenges in producing results that accurately correspond to our desired positions when the input text involves multiple objects. To address the issue of controllability in generating multiple objects, this paper introduces COMOGen, a COntrollable text-to-3D Multi-Object Generation framework. COMOGen enables the simultaneous generation of multiple 3D objects by the distillation of layout and multi-view prior knowledge. The framework consists of three modules: the layout control module, the multi-view consistency control module, and the 3D content enhancement module. Moreover, to integrate these three modules as an integral framework, we propose Layout Multi-view Score Distillation, which unifies two prior knowledge and further enhances the diversity and quality of generated 3D content. Comprehensive experiments demonstrate the effectiveness of our approach compared to the state-of-the-art methods, which represents a significant step forward in enabling more controlled and versatile text-based 3D content generation.

Via

Access Paper or Ask Questions

Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

Jul 03, 2024

Tao Chen, XiRuo Jiang, Gensheng Pei, Zeren Sun, Yucheng Wang, Yazhou Yao

Figure 1 for Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

Figure 2 for Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

Figure 3 for Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

Figure 4 for Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

Abstract:Though adversarial erasing has prevailed in weakly supervised semantic segmentation to help activate integral object regions, existing approaches still suffer from the dilemma of under-activation and over-expansion due to the difficulty in determining when to stop erasing. In this paper, we propose a \textbf{K}nowledge \textbf{T}ransfer with \textbf{S}imulated Inter-Image \textbf{E}rasing (KTSE) approach for weakly supervised semantic segmentation to alleviate the above problem. In contrast to existing erasing-based methods that remove the discriminative part for more object discovery, we propose a simulated inter-image erasing scenario to weaken the original activation by introducing extra object information. Then, object knowledge is transferred from the anchor image to the consequent less activated localization map to strengthen network localization ability. Considering the adopted bidirectional alignment will also weaken the anchor image activation if appropriate constraints are missing, we propose a self-supervised regularization module to maintain the reliable activation in discriminative regions and improve the inter-class object boundary recognition for complex images with multiple categories of objects. In addition, we resort to intra-image erasing and propose a multi-granularity alignment module to gently enlarge the object activation to boost the object knowledge transfer. Extensive experiments and ablation studies on PASCAL VOC 2012 and COCO datasets demonstrate the superiority of our proposed approach. Source codes and models are available at https://github.com/NUST-Machine-Intelligence-Laboratory/KTSE.

* accepted by the European Conference on Computer Vision (ECCV), 2024

Via

Access Paper or Ask Questions

Anti-Collapse Loss for Deep Metric Learning Based on Coding Rate Metric

Jul 03, 2024

Xiruo Jiang, Yazhou Yao, Xili Dai, Fumin Shen, Xian-Sheng Hua, Heng-Tao Shen

Abstract:Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer from the collapse of the embedding space due to their over-reliance on label information. This leads to sub-optimal feature representation and inferior model performance. To maintain the structure of embedding space and avoid feature collapse, we propose a novel loss function called Anti-Collapse Loss. Specifically, our proposed loss primarily draws inspiration from the principle of Maximal Coding Rate Reduction. It promotes the sparseness of feature clusters in the embedding space to prevent collapse by maximizing the average coding rate of sample features or class proxies. Moreover, we integrate our proposed loss with pair-based and proxy-based methods, resulting in notable performance improvement. Comprehensive experiments on benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art methods. Extensive ablation studies verify the effectiveness of our method in preventing embedding space collapse and promoting generalization performance.

* accepted by IEEE Transactions on Multimedia

Via

Access Paper or Ask Questions