Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yu Jin Kim

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

May 26, 2025

Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Moontae Lee

Abstract:As Large Language Models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into Reinforcement Learning from Human Feedback (RLHF). However, these approaches tend to be complex, as they encompass complicated procedures in RLHF along with additional steps required by the safety constraints. Inspired by Direct Preference Optimization (DPO), we introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning, without requiring relaxation. SafeDPO introduces only one additional hyperparameter to further enhance safety and requires only minor modifications to standard DPO. As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms, both in terms of aligning with human preferences and improving safety.

* 34 pages

Via

Access Paper or Ask Questions

Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning

Jun 22, 2022

Yu Jin Kim, Beong-woo Kwak, Youngwook Kim, Reinald Kim Amplayo, Seung-won Hwang, Jinyoung Yeo

Figure 1 for Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning

Figure 2 for Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning

Figure 3 for Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning

Figure 4 for Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning

Abstract:Commonsense reasoning systems should be able to generalize to diverse reasoning cases. However, most state-of-the-art approaches depend on expensive data annotations and overfit to a specific benchmark without learning how to perform general semantic reasoning. To overcome these drawbacks, zero-shot QA systems have shown promise as a robust learning scheme by transforming a commonsense knowledge graph (KG) into synthetic QA-form samples for model training. Considering the increasing type of different commonsense KGs, this paper aims to extend the zero-shot transfer learning scenario into multiple-source settings, where different KGs can be utilized synergetically. Towards this goal, we propose to mitigate the loss of knowledge from the interference among the different knowledge sources, by developing a modular variant of the knowledge aggregation as a new zero-shot commonsense reasoning framework. Results on five commonsense reasoning benchmarks demonstrate the efficacy of our framework, improving the performance with multiple KGs.

* Accepted to NAACL2022

Via

Access Paper or Ask Questions

TrustAL: Trustworthy Active Learning using Knowledge Distillation

Jan 26, 2022

Beong-woo Kwak, Youngwook Kim, Yu Jin Kim, Seung-won Hwang, Jinyoung Yeo

Figure 1 for TrustAL: Trustworthy Active Learning using Knowledge Distillation

Figure 2 for TrustAL: Trustworthy Active Learning using Knowledge Distillation

Figure 3 for TrustAL: Trustworthy Active Learning using Knowledge Distillation

Figure 4 for TrustAL: Trustworthy Active Learning using Knowledge Distillation

Abstract:Active learning can be defined as iterations of data labeling, model training, and data acquisition, until sufficient labels are acquired. A traditional view of data acquisition is that, through iterations, knowledge from human labels and models is implicitly distilled to monotonically increase the accuracy and label consistency. Under this assumption, the most recently trained model is a good surrogate for the current labeled data, from which data acquisition is requested based on uncertainty/diversity. Our contribution is debunking this myth and proposing a new objective for distillation. First, we found example forgetting, which indicates the loss of knowledge learned across iterations. Second, for this reason, the last model is no longer the best teacher -- For mitigating such forgotten knowledge, we select one of its predecessor models as a teacher, by our proposed notion of "consistency". We show that this novel distillation is distinctive in the following three aspects; First, consistency ensures to avoid forgetting labels. Second, consistency improves both uncertainty/diversity of labeled data. Lastly, consistency redeems defective labels produced by human annotators.

* Accepted to AAAI2022

Via

Access Paper or Ask Questions