Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuxi Li

Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

May 24, 2025

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin

Figure 1 for Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Figure 2 for Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Figure 3 for Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Figure 4 for Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Abstract:Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).

* Under Review

Via

Access Paper or Ask Questions

XeMap: Contextual Referring in Large-Scale Remote Sensing Environments

Apr 30, 2025

Yuxi Li, Lu Si, Yujie Hou, Chengaung Liu, Bin Li, Hongjian Fang, Jun Zhang

Figure 1 for XeMap: Contextual Referring in Large-Scale Remote Sensing Environments

Figure 2 for XeMap: Contextual Referring in Large-Scale Remote Sensing Environments

Figure 3 for XeMap: Contextual Referring in Large-Scale Remote Sensing Environments

Figure 4 for XeMap: Contextual Referring in Large-Scale Remote Sensing Environments

Abstract:Advancements in remote sensing (RS) imagery have provided high-resolution detail and vast coverage, yet existing methods, such as image-level captioning/retrieval and object-level detection/segmentation, often fail to capture mid-scale semantic entities essential for interpreting large-scale scenes. To address this, we propose the conteXtual referring Map (XeMap) task, which focuses on contextual, fine-grained localization of text-referred regions in large-scale RS scenes. Unlike traditional approaches, XeMap enables precise mapping of mid-scale semantic entities that are often overlooked in image-level or object-level methods. To achieve this, we introduce XeMap-Network, a novel architecture designed to handle the complexities of pixel-level cross-modal contextual referring mapping in RS. The network includes a fusion layer that applies self- and cross-attention mechanisms to enhance the interaction between text and image embeddings. Furthermore, we propose a Hierarchical Multi-Scale Semantic Alignment (HMSA) module that aligns multiscale visual features with the text semantic vector, enabling precise multimodal matching across large-scale RS imagery. To support XeMap task, we provide a novel, annotated dataset, XeMap-set, specifically tailored for this task, overcoming the lack of XeMap datasets in RS imagery. XeMap-Network is evaluated in a zero-shot setting against state-of-the-art methods, demonstrating superior performance. This highlights its effectiveness in accurately mapping referring regions and providing valuable insights for interpreting large-scale RS environments.

* 14 pages, 8 figures

Via

Access Paper or Ask Questions

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Dec 11, 2024

Yuxi Li, Zhibo Zhang, Kailong Wang, Ling Shi, Haoyu Wang

Figure 1 for Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Figure 2 for Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Figure 3 for Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Figure 4 for Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Abstract:Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM framework, our method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open-source LLMs, maintaining high performance. Unlike existing methods, D-LLM eliminates the need for specific triggers or harmful response collections, offering a stealthier and more effective jailbreak strategy. This work reveals a covert and robust threat vector in LLM security and emphasizes the need for stronger safeguards in model safety alignment.

Via

Access Paper or Ask Questions

GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

Aug 09, 2024

Zhibo Zhang, Wuxia Bai, Yuxi Li, Mark Huasong Meng, Kailong Wang, Ling Shi, Li Li, Jun Wang, Haoyu Wang

Figure 1 for GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

Figure 2 for GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

Figure 3 for GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

Figure 4 for GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

Abstract:Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model's vocabulary space and named them "glitch tokens". Those tokens, once included in the input, may induce the model to produce incorrect, irrelevant, or even harmful results, drastically undermining the reliability and practicality of LLMs. In this work, we aim to enhance the understanding of glitch tokens and propose techniques for their detection and mitigation. We first reveal the characteristic features induced by glitch tokens on LLMs, which are evidenced by significant deviations in the distributions of attention patterns and dynamic information from intermediate model layers. Based on the insights, we develop GlitchProber, a tool for efficient glitch token detection and mitigation. GlitchProber utilizes small-scale sampling, principal component analysis for accelerated feature extraction, and a simple classifier for efficient vocabulary screening. Taking one step further, GlitchProber rectifies abnormal model intermediate layer values to mitigate the destructive effects of glitch tokens. Evaluated on five mainstream open-source LLMs, GlitchProber demonstrates higher efficiency, precision, and recall compared to existing approaches, with an average F1 score of 0.86 and an average repair rate of 50.06%. GlitchProber unveils a novel path to address the challenges posed by glitch tokens and inspires future research toward more robust and interpretable LLMs.

Via

Access Paper or Ask Questions

DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction

Jul 25, 2024

Chaofan Gan, Yuanpeng Tu, Yuxi Li, Weiyao Lin

Figure 1 for DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction

Figure 2 for DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction

Figure 3 for DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction

Figure 4 for DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction

Abstract:With the recent burst of 2D and 3D data, cross-modal retrieval has attracted increasing attention recently. However, manual labeling by non-experts will inevitably introduce corrupted annotations given ambiguous 2D/3D content. Though previous works have addressed this issue by designing a naive division strategy with hand-crafted thresholds, their performance generally exhibits great sensitivity to the threshold value. Besides, they fail to fully utilize the valuable supervisory signals within each divided subset. To tackle this problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD) and Adaptive Alignment and Correction (AAC). Specifically, the former performs accurate sample division by adaptive credibility modeling for each sample based on the compensation information within multimodal loss distribution. Then in AAC, samples in distinct subsets are exploited with different alignment strategies to fully enhance the semantic compactness and meanwhile alleviate over-fitting to noisy labels, where a self-correction strategy is introduced to improve the quality of representation. Moreover. To evaluate the effectiveness in real-world scenarios, we introduce a challenging noisy benchmark, namely Objaverse-N200, which comprises 200k-level samples annotated with 1156 realistic noisy labels. Extensive experiments on both traditional and the newly proposed benchmarks demonstrate the generality and superiority of our DAC, where DAC outperforms state-of-the-art models by a large margin. (i.e., with +5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).

* accepted by ACM MM 2024

Via

Access Paper or Ask Questions

AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

May 28, 2024

Sihe Zhang, Qingdong He, Jinlong Peng, Yuxi Li, Zhengkai Jiang, Jiafu Wu, Mingmin Chi, Yabiao Wang, Chengjie Wang

Figure 1 for AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Figure 2 for AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Figure 3 for AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Figure 4 for AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Abstract:Image retrieval aims to identify visually similar images within a database using a given query image. Traditional methods typically employ both global and local features extracted from images for matching, and may also apply re-ranking techniques to enhance accuracy. However, these methods often fail to account for the noise present in query images, which can stem from natural or human-induced factors, thereby negatively impacting retrieval performance. To mitigate this issue, we introduce a novel setting for low-quality image retrieval, and propose an Adaptive Noise-Based Network (AdapNet) to learn robust abstract representations. Specifically, we devise a quality compensation block trained to compensate for various low-quality factors in input images. Besides, we introduce an innovative adaptive noise-based loss function, which dynamically adjusts its focus on the gradient in accordance with image quality, thereby augmenting the learning of unknown noisy samples during training and enhancing intra-class compactness. To assess the performance, we construct two datasets with low-quality queries, which is built by applying various types of noise on clean query images on the standard Revisited Oxford and Revisited Paris datasets. Comprehensive experimental results illustrate that AdapNet surpasses state-of-the-art methods on the Noise Revisited Oxford and Noise Revisited Paris benchmarks, while maintaining competitive performance on high-quality datasets. The code and constructed datasets will be made available.

Via

Access Paper or Ask Questions

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

May 20, 2024

Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, Kailong Wang

Figure 1 for Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

Figure 2 for Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

Figure 3 for Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

Figure 4 for Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

Abstract:Large language models (LLMs) have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate unintended and potentially harmful content. Existing token-level jailbreaking techniques, while effective, face scalability and efficiency challenges, especially as models undergo frequent updates and incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative token-level manipulation approach that addresses these limitations effectively. JailMine employs an automated "mining" process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, we demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies. Our work contributes to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful language models.

Via

Access Paper or Ask Questions

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

Apr 19, 2024

Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, Haoyu Wang

Abstract:With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of "glitch tokens", which are anomalous tokens produced by established tokenizers and could potentially compromise the models' quality of response. Specifically, we experiment on seven top popular LLMs utilizing three distinct tokenizers and involving a totally of 182,517 tokens. We present categorizations of the identified glitch tokens and symptoms exhibited by LLMs when interacting with glitch tokens. Based on our observation that glitch tokens tend to cluster in the embedding space, we propose GlitchHunter, a novel iterative clustering-based technique, for efficient glitch token detection. The evaluation shows that our approach notably outperforms three baseline methods on eight open-source LLMs. To the best of our knowledge, we present the first comprehensive study on glitch tokens. Our new detection further provides valuable insights into mitigating tokenization-related errors in LLMs.

Via

Access Paper or Ask Questions

Memory Consistency Guided Divide-and-Conquer Learning for Generalized Category Discovery

Feb 01, 2024

Yuanpeng Tu, Zhun Zhong, Yuxi Li, Hengshuang Zhao

Abstract:Generalized category discovery (GCD) aims at addressing a more realistic and challenging setting of semi-supervised learning, where only part of the category labels are assigned to certain training samples. Previous methods generally employ naive contrastive learning or unsupervised clustering scheme for all the samples. Nevertheless, they usually ignore the inherent critical information within the historical predictions of the model being trained. Specifically, we empirically reveal that a significant number of salient unlabeled samples yield consistent historical predictions corresponding to their ground truth category. From this observation, we propose a Memory Consistency guided Divide-and-conquer Learning framework (MCDL). In this framework, we introduce two memory banks to record historical prediction of unlabeled data, which are exploited to measure the credibility of each sample in terms of its prediction consistency. With the guidance of credibility, we can design a divide-and-conquer learning strategy to fully utilize the discriminative information of unlabeled data while alleviating the negative influence of noisy labels. Extensive experimental results on multiple benchmarks demonstrate the generality and superiority of our method, where our method outperforms state-of-the-art models by a large margin on both seen and unseen classes of the generic image recognition and challenging semantic shift settings (i.e.,with +8.4% gain on CUB and +8.1% on Standford Cars).

Via

Access Paper or Ask Questions

Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Jan 17, 2024

Yuanpeng Tu, Boshen Zhang, Liang Liu, Yuxi Li, Xuhai Chen, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Cai Rong Zhao

Figure 1 for Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Figure 2 for Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Figure 3 for Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Figure 4 for Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Abstract:Industrial anomaly detection is generally addressed as an unsupervised task that aims at locating defects with only normal training samples. Recently, numerous 2D anomaly detection methods have been proposed and have achieved promising results, however, using only the 2D RGB data as input is not sufficient to identify imperceptible geometric surface anomalies. Hence, in this work, we focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets, i.e., ImageNet, to construct feature databases. And we empirically find that directly using these pre-trained models is not optimal, it can either fail to detect subtle defects or mistake abnormal features as normal ones. This may be attributed to the domain gap between target industrial data and source data.Towards this problem, we propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.Both intra-modal adaptation and cross-modal alignment are optimized from a local-to-global perspective in LSFA to ensure the representation quality and consistency in the inference stage.Extensive experiments demonstrate that our method not only brings a significant performance boost to feature embedding based approaches, but also outperforms previous State-of-The-Art (SoTA) methods prominently on both MVTec-3D AD and Eyecandies datasets, e.g., LSFA achieves 97.1% I-AUROC on MVTec-3D, surpass previous SoTA by +3.4%.

Via

Access Paper or Ask Questions