Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiangnan Li

CV-DCLR: Causal-Visual Dynamic Label Refinement for Robust Zero-Shot Learning

Jul 01, 2026

Can Wang, Jiangnan Li, Mingyu Li, Yining Song, Kangrui Ren, Min Gan, Jinfu Fan

Abstract:Zero-Shot Learning (ZSL) facilitates knowledge transfer via shared semantic spaces. However, a critical bottleneck in this paradigm is Semantic Entanglement, where visual representations are inevitably conflated with visually similar semantic concepts, such as distinguishing the intrinsic traits of a Wolf from the shared features of a Husky. Existing global alignment methods often indiscriminately maximize correlations between visual and semantic modalities, leading models to overfit spurious similarities rather than capturing distinctive class identities. To address this fundamental limitation, we propose the Causal-Visual Dynamic Label Refinement (CV-DCLR) framework. Unlike traditional approaches that rely on superficial visual statistics, CV-DCLR recalibrates visual-semantic associations via a Dual-Stream Mutual Correction Mechanism. This includes a Visual Likelihood Stream to model observational patterns and a Causal Importance Stream that verifies the structural necessity of candidate prototypes through Counterfactual Intervention. Acting as a logical filter, our adaptive gating mechanism dynamically modulates feature responses to amplify genuine causal traits while suppressing visually plausible but structurally irrelevant distractors. Extensive experiments on the CUB, SUN, and AWA2 benchmarks under a rigorous Semantic Entanglement Injection protocol demonstrate that CV-DCLR significantly outperforms state-of-the-art methods in high-ambiguity scenarios. Specifically, while existing models suffer catastrophic degradation under entanglement, our framework maintains robust performance, effectively disentangling true class identities from semantic confounders.

Via

Access Paper or Ask Questions

MiA-Signature: Approximating Global Activation for Long-Context Understanding

May 07, 2026

Yuqing Li, Jiangnan Li, Mo Yu, Zheng Lin, Weiping Wang, Jie Zhou

Abstract:A growing body of work in cognitive science suggests that reportable conscious access is associated with \emph{global ignition} over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumerate all activated contents. This tension suggests a plausible mechanism that cognition may rely on a compact representation that approximates the global influence of activation on downstream processing. Inspired by this idea, we introduce the concept of \textbf{Mindscape Activation Signature (MiA-Signature)}, a compressed representation of the global activation pattern induced by a query. In LLM systems, this is instantiated via submodular-based selection of high-level concepts that cover the activated context space, optionally refined through lightweight iterative updates using working memory. The resulting MiA-Signature serves as a conditioning signal that approximates the effect of the full activation state while remaining computationally tractable. Integrating MiA-Signatures into both RAG and agentic systems yields consistent performance gains across multiple long-context understanding tasks.

* This is a work in progress; we will continue to revise and improve the manuscript

Via

Access Paper or Ask Questions

Dynamic Visual-semantic Alignment for Zero-shot Learning with Ambiguous Labels

Apr 20, 2026

Jiangnan Li, Linqing Huang, Xiaowen Yan, Min Gan, Wenpeng Lu, Jinfu Fan

Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes without visual instances. However, existing methods usually assume clean labels, overlooking real-world label noise and ambiguity, which degrades performance. To bridge this gap, we propose the Dynamic Visual-semantic Alignment (DVSA), a robust ZSL framework for learning from ambiguous labels. DVSA uses a bidirectional visual-semantic alignment module with attention to mutually calibrate visual features and attribute prototypes, and a contrastive optimization grounded in Mutual Information (MI) at the attribute level to strengthen discriminative, semantically consistent attributes. In addition, a dynamic label disambiguation mechanism iteratively corrects noisy supervision while preserving semantic consistency, narrowing the instance-label gap, and improving generalization. Extensive experiments on standard benchmarks verify that DVSA achieves stronger performance under ambiguous supervision.

* Accepted by ICME 2026 (IEEE International Conference on Multimedia and Expo)

Via

Access Paper or Ask Questions

CLIP-driven Zero-shot Learning with Ambiguous Labels

Mar 05, 2026

Jinfu Fan, Jiangnan Li, Xiaowen Yan, Xiaohui Zhong, Wenpeng Lu, Linqing Huang

Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.

* Accepted by ICASSP 2026 (IEEE International Conference on Acoustics, Speech, and Signal Processing)

Via

Access Paper or Ask Questions

Query-focused and Memory-aware Reranker for Long Context Processing

Feb 12, 2026

Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin, Weiping Wang, Jie Zhou

Abstract:Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

* 14 pages, 2 figures

Via

Access Paper or Ask Questions

ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection

Jan 22, 2026

Guoxuan Ding, Yuqing Li, Ziyan Zhou, Zheng Lin, Daren Zha, Jiangnan Li

Abstract:The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.

* 11 pages, 3 figures, 7 tables

Via

Access Paper or Ask Questions

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

Dec 19, 2025

Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu

Abstract:Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.

Via

Access Paper or Ask Questions

Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Jun 10, 2025

Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou

Figure 1 for Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Figure 2 for Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Figure 3 for Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Figure 4 for Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Abstract:This work focuses on an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within the semantics, resulting in failed dense retrieval on even simple cases. To examine such behaviors, we first introduce a new evaluation dataset in Chinese, named CapRetrieval, whose passages are image captions, and queries are phrases inquiring entities or events in various forms. Zero-shot evaluation suggests that encoders may fail on these fine-grained matching, regardless of training sources or model sizes. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, which obtains the best performance on CapRetrieval. Within this process, we further identify an issue of granularity dilemma, a challenge for embeddings to express fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.

Via

Access Paper or Ask Questions

CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment

Mar 31, 2025

Jiangnan Li, Thuy-Trang Vu, Christian Herold, Amirhossein Tebbifakhr, Shahram Khadivi, Gholamreza Haffari

Abstract:Naive joint training of large language models (LLMs) for multilingual preference alignment can suffer from negative interference. This is a known issue in multilingual training, where conflicting objectives degrade overall performance. However, the impact of this phenomenon in the context of multilingual preference alignment remains largely underexplored. To address this issue, we propose CONGRAD, a scalable and effective filtering method that selects high-quality preference samples with minimal gradient conflicts across languages. Our method leverages gradient surgery to retain samples aligned with an aggregated multilingual update direction. Additionally, we incorporate a sublinear gradient compression strategy that reduces memory overhead during gradient accumulation. We integrate CONGRAD into self-rewarding framework and evaluate on LLaMA3-8B and Gemma2-2B across 10 languages. Results show that CONGRAD consistently outperforms strong baselines in both seen and unseen languages, with minimal alignment tax.

Via

Access Paper or Ask Questions

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Feb 13, 2025

Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou

Figure 1 for The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Figure 2 for The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Figure 3 for The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Figure 4 for The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Abstract:In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.

* NAACL 2025 Main Conference. First 5 authors contributed equally. Project page: https://physico-benchmark.github.io/

Via

Access Paper or Ask Questions