Picture for Juntao Dai

Juntao Dai

SafeLawBench: Towards Safe Alignment of Large Language Models

Add code
Jun 07, 2025
Viaarxiv icon

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Add code
May 29, 2025
Viaarxiv icon

The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels

Add code
May 26, 2025
Viaarxiv icon

Mitigating Deceptive Alignment via Self-Monitoring

Add code
May 24, 2025
Viaarxiv icon

Measuring Hong Kong Massive Multi-Task Language Understanding

Add code
May 04, 2025
Viaarxiv icon

ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

Add code
Mar 17, 2025
Viaarxiv icon

Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

Add code
Dec 15, 2024
Viaarxiv icon

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Add code
Aug 30, 2024
Viaarxiv icon

Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

Add code
Feb 06, 2024
Viaarxiv icon

AI Alignment: A Comprehensive Survey

Add code
Nov 01, 2023
Viaarxiv icon