Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

Sep 26, 2025

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem

Figure 1 for We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

Figure 2 for We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

Figure 3 for We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

Figure 4 for We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

Share this with someone who'll enjoy it:

Abstract:Alignment of Large Language Models (LLMs) along multiple objectives-helpfulness, harmlessness, and honesty (HHH)-is critical for safe and reliable deployment. Prior work has used steering vector-small control signals injected into hidden states-to guide LLM outputs, typically via one-to-one (1-to-1) Transformer decoders. In this setting, optimizing a single alignment objective can inadvertently overwrite representations learned for other objectives, leading to catastrophic forgetting. More recent approaches extend steering vectors via one-to-many (1-to-N) Transformer decoders. While this alleviates catastrophic forgetting, naive multi-branch designs optimize each objective independently, which can cause inference fragmentation-outputs across HHH objectives may become inconsistent. We propose Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework for unified and efficient multi-objective alignment. In Stage I, post-attention hidden states of the Transformer layer are computed once to form a shared representation. In Stage II, this representation is cloned into parallel branches and steered via a policy-reference mechanism, enabling objective-specific control while maintaining cross-objective consistency. Empirical evaluations on Alpaca, BeaverTails, and TruthfulQA show that AMBS consistently improves HHH alignment across multiple 7B LLM backbones. For example, on DeepSeek-7B, AMBS improves average alignment scores by +32.4% and reduces unsafe outputs by 11.0% compared to a naive 1-to-N baseline, while remaining competitive with state-of-the-art methods.

View paper on

Share this with someone who'll enjoy it:

Title:We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

Paper and Code