Picture for Rongwu Xu

Rongwu Xu

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Add code
Aug 06, 2025
Viaarxiv icon

The Singapore Consensus on Global AI Safety Research Priorities

Add code
Jun 25, 2025
Viaarxiv icon

Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

Add code
May 23, 2025
Viaarxiv icon

LIFEBench: Evaluating Length Instruction Following in Large Language Models

Add code
May 22, 2025
Viaarxiv icon

AI Awareness

Add code
Apr 25, 2025
Viaarxiv icon

"Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents

Add code
Feb 17, 2025
Figure 1 for "Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents
Figure 2 for "Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents
Figure 3 for "Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents
Figure 4 for "Nuclear Deployed!": Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents
Viaarxiv icon

Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

Add code
Oct 31, 2024
Viaarxiv icon

Sing it, Narrate it: Quality Musical Lyrics Translation

Add code
Oct 29, 2024
Viaarxiv icon

On the Role of Attention Heads in Large Language Model Safety

Add code
Oct 17, 2024
Figure 1 for On the Role of Attention Heads in Large Language Model Safety
Figure 2 for On the Role of Attention Heads in Large Language Model Safety
Figure 3 for On the Role of Attention Heads in Large Language Model Safety
Figure 4 for On the Role of Attention Heads in Large Language Model Safety
Viaarxiv icon

DebateQA: Evaluating Question Answering on Debatable Knowledge

Add code
Aug 02, 2024
Figure 1 for DebateQA: Evaluating Question Answering on Debatable Knowledge
Figure 2 for DebateQA: Evaluating Question Answering on Debatable Knowledge
Figure 3 for DebateQA: Evaluating Question Answering on Debatable Knowledge
Figure 4 for DebateQA: Evaluating Question Answering on Debatable Knowledge
Viaarxiv icon