Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jigang Wang

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Jun 11, 2025

Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song(+1 more)

Figure 1 for AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Figure 2 for AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Figure 3 for AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Figure 4 for AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Abstract:Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT

Via

Access Paper or Ask Questions

Research on the application of contrastive learning in multi-label text classification

Dec 01, 2022

Nankai Lin, Guanqiu Qin, Jigang Wang, Aimin Yang, Dong Zhou

Figure 1 for Research on the application of contrastive learning in multi-label text classification

Figure 2 for Research on the application of contrastive learning in multi-label text classification

Figure 3 for Research on the application of contrastive learning in multi-label text classification

Figure 4 for Research on the application of contrastive learning in multi-label text classification

Abstract:The effective application of contrastive learning technology in natural language processing tasks shows the superiority of contrastive learning in text analysis tasks. How to construct positive and negative samples correctly and reasonably is the core challenge of contrastive learning. Since it is difficult to construct contrastive objects in multi-label multi-classification tasks, there are few contrastive losses for multi-label multi-classification text classification. In this paper, we propose five contrastive losses for multi-label multi-classification tasks. They are Strict Contrastive Loss (SCL), Intra-label Contrastive Loss (ICL), Jaccard Similarity Contrastive Loss (JSCL), and Jaccard Similarity Probability Contrastive Loss (JSPCL) and Stepwise Label Contrastive Loss (SLCL). We explore the effectiveness of contrastive learning for multi-label multi-classification tasks under different strategies, and provide a set of baseline methods for contrastive learning techniques on multi-label classification tasks. We also perform an interpretability analysis of our approach to show how different contrastive learning methods play their roles. The experimental results in this paper demonstrate that our proposed contrastive losses can bring some improvement for multi-label multi-classification tasks. Our work reveal how to "appropriately" change the contrastive way of contrastive learning is the key idea to improve the adaptability of contrastive learning in multi-label multi-classification tasks.

Via

Access Paper or Ask Questions