Abstract:Audio generation and audio-to-text understanding remain largely separate, with diffusion models dominating high-fidelity synthesis and autoregressive (AR) language models driving captioning and semantic prediction. Existing unified approaches typically rely on either heterogeneous modules or AR-centric modeling, which can hinder joint optimization and limit acoustic fidelity. We present UAT, to our knowledge, the first diffusion-centric framework that supports unified audio generation, editing, and captioning. UAT couples continuous latent diffusion for audio with masked discrete diffusion for text, enabling bidirectional audio-text modeling within a shared dual-stream backbone. Experiments show that UAT preserves strong audio generation and editing capabilities while achieving competitive captioning performance, demonstrating a favorable balance between acoustic synthesis and semantic prediction. Demo samples are available at https://UAT-demo.github.io.
Abstract:In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.
Abstract:As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain. It contains 1,094 queries that simulate ordinary solution-seeking activities whose responses may induce latent ecological impact. Through experiments across five widely used LLMs, we analyze the effects of response length, languages, and modality. Experimental results reveal consistent, significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in (multimodal) species protection. These findings highlight a critical gap between current safety alignment and the requirements of real-world ecological responsibility, underscoring the need for proactive safeguards in LLM deployment.
Abstract:Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the large language model.