Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuhui Wang

Key Lab of Intell. Info. Process., Inst. of Comput. Tech., Chinese Academy of Sciences

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

Mar 17, 2026

TianTian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang, Shuhui Wang

Abstract:Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance.

* Accepted by CVPR 2026

Via

Access Paper or Ask Questions

Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue

Feb 09, 2025

Shuo Cai, Xinzhe Han, Shuhui Wang

Abstract:Goal-oriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent results. We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. The above process is implemented by maximizing a binary reward inspired by the ``divide-and-conquer'' paradigm. We further design a candidate-minimization reward which encourages the model to narrow down the scope of candidate objects toward the end of the dialogue. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches. Qualitative results further show that TSADE facilitates agents to generate higher-quality questions.

Via

Access Paper or Ask Questions

Expanding Sparse Tuning for Low Memory Usage

Nov 04, 2024

Shufan Shen, Junshu Sun, Xiangyang Ji, Qingming Huang, Shuhui Wang

Figure 1 for Expanding Sparse Tuning for Low Memory Usage

Figure 2 for Expanding Sparse Tuning for Low Memory Usage

Figure 3 for Expanding Sparse Tuning for Low Memory Usage

Figure 4 for Expanding Sparse Tuning for Low Memory Usage

Abstract:Parameter-efficient fine-tuning (PEFT) is an effective method for adapting pre-trained vision models to downstream tasks by tuning a small subset of parameters. Among PEFT methods, sparse tuning achieves superior performance by only adjusting the weights most relevant to downstream tasks, rather than densely tuning the whole weight matrix. However, this performance improvement has been accompanied by increases in memory usage, which stems from two factors, i.e., the storage of the whole weight matrix as learnable parameters in the optimizer and the additional storage of tunable weight indexes. In this paper, we propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the whole original matrix. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes. To maintain the effectiveness of sparse tuning with low-rank matrices, we extend the low-rank decomposition by applying nonlinear kernel functions to the whole-matrix merging. Consequently, we gain an increase in the rank of the merged matrix, enhancing the ability of SNELL in adapting the pre-trained models to downstream tasks. Extensive experiments on multiple downstream tasks show that SNELL achieves state-of-the-art performance with low memory usage, endowing PEFT with sparse tuning to large-scale models. Codes are available at https://github.com/ssfgunner/SNELL.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

Towards Dynamic Message Passing on Graphs

Oct 31, 2024

Junshu Sun, Chenxue Yang, Xiangyang Ji, Qingming Huang, Shuhui Wang

Figure 1 for Towards Dynamic Message Passing on Graphs

Figure 2 for Towards Dynamic Message Passing on Graphs

Figure 3 for Towards Dynamic Message Passing on Graphs

Figure 4 for Towards Dynamic Message Passing on Graphs

Abstract:Message passing plays a vital role in graph neural networks (GNNs) for effective feature learning. However, the over-reliance on input topology diminishes the efficacy of message passing and restricts the ability of GNNs. Despite efforts to mitigate the reliance, existing study encounters message-passing bottlenecks or high computational expense problems, which invokes the demands for flexible message passing with low complexity. In this paper, we propose a novel dynamic message-passing mechanism for GNNs. It projects graph nodes and learnable pseudo nodes into a common space with measurable spatial relations between them. With nodes moving in the space, their evolving relations facilitate flexible pathway construction for a dynamic message-passing process. Associating pseudo nodes to input graphs with their measured relations, graph nodes can communicate with each other intermediately through pseudo nodes under linear complexity. We further develop a GNN model named $\mathtt{\mathbf{N^2}}$ based on our dynamic message-passing mechanism. $\mathtt{\mathbf{N^2}}$ employs a single recurrent layer to recursively generate the displacements of nodes and construct optimal dynamic pathways. Evaluation on eighteen benchmarks demonstrates the superior performance of $\mathtt{\mathbf{N^2}}$ over popular GNNs. $\mathtt{\mathbf{N^2}}$ successfully scales to large-scale benchmarks and requires significantly fewer parameters for graph classification with the shared recurrent layer.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Oct 12, 2024

Ting Yu, Kunhao Fu, Shuhui Wang, Qingming Huang, Jun Yu

Figure 1 for Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Figure 2 for Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Figure 3 for Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Figure 4 for Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Abstract:Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning. By delivering fine-grained heuristics, we improve the model's ability to identify and interpret key entities and actions, thereby enhancing its reasoning capabilities. Extensive evaluations across multiple VideoQA datasets demonstrate that our method significantly outperforms existing models, underscoring the importance of integrating domain-specific knowledge into video-language models for more accurate and context-aware VideoQA.

* IEEE Transactions on Circuits and Systems for Video Technology, 2024
* IEEE Transactions on Circuits and Systems for Video Technology

Via

Access Paper or Ask Questions

Scalable Graph Compressed Convolutions

Jul 26, 2024

Junshu Sun, Chenxue Yang, Shuhui Wang, Qingming Huang

Figure 1 for Scalable Graph Compressed Convolutions

Figure 2 for Scalable Graph Compressed Convolutions

Figure 3 for Scalable Graph Compressed Convolutions

Figure 4 for Scalable Graph Compressed Convolutions

Abstract:Designing effective graph neural networks (GNNs) with message passing has two fundamental challenges, i.e., determining optimal message-passing pathways and designing local aggregators. Previous methods of designing optimal pathways are limited with information loss on the input features. On the other hand, existing local aggregators generally fail to extract multi-scale features and approximate diverse operators under limited parameter scales. In contrast to these methods, Euclidean convolution has been proven as an expressive aggregator, making it a perfect candidate for GNN construction. However, the challenges of generalizing Euclidean convolution to graphs arise from the irregular structure of graphs. To bridge the gap between Euclidean space and graph topology, we propose a differentiable method that applies permutations to calibrate input graphs for Euclidean convolution. The permutations constrain all nodes in a row regardless of their input order and therefore enable the flexible generalization of Euclidean convolution to graphs. Based on the graph calibration, we propose the Compressed Convolution Network (CoCN) for hierarchical graph representation learning. CoCN follows local feature-learning and global parameter-sharing mechanisms of convolution neural networks. The whole model can be trained end-to-end, with compressed convolution applied to learn individual node features and their corresponding structure features. CoCN can further borrow successful practices from Euclidean convolution, including residual connection and inception mechanism. We validate CoCN on both node-level and graph-level benchmarks. CoCN achieves superior performance over competitive GNN baselines. Codes are available at https://github.com/sunjss/CoCN.

Via

Access Paper or Ask Questions

Uncertainty-boosted Robust Video Activity Anticipation

Apr 29, 2024

Zhaobo Qi, Shuhui Wang, Weigang Zhang, Qingming Huang

Figure 1 for Uncertainty-boosted Robust Video Activity Anticipation

Figure 2 for Uncertainty-boosted Robust Video Activity Anticipation

Figure 3 for Uncertainty-boosted Robust Video Activity Anticipation

Figure 4 for Uncertainty-boosted Robust Video Activity Anticipation

Abstract:Video activity anticipation aims to predict what will happen in the future, embracing a broad application prospect ranging from robot vision and autonomous driving. Despite the recent progress, the data uncertainty issue, reflected as the content evolution process and dynamic correlation in event labels, has been somehow ignored. This reduces the model generalization ability and deep understanding on video content, leading to serious error accumulation and degraded performance. In this paper, we address the uncertainty learning problem and propose an uncertainty-boosted robust video activity anticipation framework, which generates uncertainty values to indicate the credibility of the anticipation results. The uncertainty value is used to derive a temperature parameter in the softmax function to modulate the predicted target activity distribution. To guarantee the distribution adjustment, we construct a reasonable target activity label representation by incorporating the activity evolution from the temporal class correlation and the semantic relationship. Moreover, we quantify the uncertainty into relative values by comparing the uncertainty among sample pairs and their temporal-lengths. This relative strategy provides a more accessible way in uncertainty modeling than quantifying the absolute uncertainty values on the whole dataset. Experiments on multiple backbones and benchmarks show our framework achieves promising performance and better robustness/interpretability. Source codes are available at https://github.com/qzhb/UbRV2A.

* Accepted by T-PAMI

Via

Access Paper or Ask Questions

Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments

Mar 19, 2024

Churan Zhi, Junbao Zhuo, Shuhui Wang

Figure 1 for Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments

Figure 2 for Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments

Figure 3 for Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments

Figure 4 for Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments

Abstract:In this paper, we address unsupervised domain adaptation under noisy environments, which is more challenging and practical than traditional domain adaptation. In this scenario, the model is prone to overfitting noisy labels, resulting in a more pronounced domain shift and a notable decline in the overall model performance. Previous methods employed prototype methods for domain adaptation on robust feature spaces. However, these approaches struggle to effectively classify classes with similar features under noisy environments. To address this issue, we propose a new method to detect and correct confusing class pair. We first divide classes into easy and hard classes based on the small loss criterion. We then leverage the top-2 predictions for each sample after aligning the source and target domain to find the confusing pair in the hard classes. We apply label correction to the noisy samples within the confusing pair. With the proposed label correction method, we can train our model with more accurate labels. Extensive experiments confirm the effectiveness of our method and demonstrate its favorable performance compared with existing state-of-the-art methods. Our codes are publicly available at https://github.com/Hehxcf/CPC/.

* AAAI 2024

Via

Access Paper or Ask Questions

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Mar 12, 2024

Ting Yu, Xiaojun Lin, Shuhui Wang, Weiguo Sheng, Qingming Huang, Jun Yu

Figure 1 for A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Figure 2 for A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Figure 3 for A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Figure 4 for A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Abstract:Three-Dimensional (3D) dense captioning is an emerging vision-language bridging task that aims to generate multiple detailed and accurate descriptions for 3D scenes. It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning, as well as complexities in data collection and processing of 3D point cloud sources. Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field, which hinders its progress. In this paper, we provide a comprehensive review of 3D dense captioning, covering task definition, architecture classification, dataset analysis, evaluation metrics, and in-depth prosperity discussions. Based on a synthesis of previous literature, we refine a standard pipeline that serves as a common paradigm for existing methods. We also introduce a clear taxonomy of existing models, summarize technologies involved in different modules, and conduct detailed experiment analysis. Instead of a chronological order introduction, we categorize the methods into different classes to facilitate exploration and analysis of the differences and connections among existing techniques. We also provide a reading guideline to assist readers with different backgrounds and purposes in reading efficiently. Furthermore, we propose a series of promising future directions for 3D dense captioning by identifying challenges and aligning them with the development of related tasks, offering valuable insights and inspiring future research in this field. Our aim is to provide a comprehensive understanding of 3D dense captioning, foster further investigations, and contribute to the development of novel applications in multimedia and related domains.

Via

Access Paper or Ask Questions

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Jan 19, 2024

Zhaobo Qi, Yibo Yuan, Xiaowen Ruan, Shuhui Wang, Weigang Zhang, Qingming Huang

Figure 1 for Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Figure 2 for Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Figure 3 for Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Figure 4 for Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Abstract:Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort to utilizing prior knowledge about bias to artificially break this uneven distribution, which only removes a limited amount of significant language biases. In this work, we propose the bias-conflict sample synthesis and adversarial removal debias strategy (BSSARD), which dynamically generates bias-conflict samples by explicitly leveraging potentially spurious correlations between single-modality features and the temporal position of the target moments. Through adversarial training, its bias generators continuously introduce biases and generate bias-conflict samples to deceive its grounding model. Meanwhile, the grounding model continuously eliminates the introduced biases, which requires it to model multi-modality alignment information. BSSARD will cover most kinds of coupling relationships and disrupt language and visual biases simultaneously. Extensive experiments on Charades-CD and ActivityNet-CD demonstrate the promising debiasing capability of BSSARD. Source codes are available at https://github.com/qzhb/BSSARD.

* accepted by AAAI 2024

Via

Access Paper or Ask Questions