Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pei Chen

Controllable Video-to-Music Generation with Multiple Time-Varying Conditions

Jul 28, 2025

Junxian Wu, Weitao You, Heda Zuo, Dengming Zhang, Pei Chen, Lingyun Sun

Abstract:Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users' needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a control-guided decoder module to integrate multiple conditions and accurately guide the music composition process. Extensive experiments demonstrate that our method outperforms existing V2M pipelines in both subjective and objective evaluations, significantly enhancing control and alignment with user expectations.

* Accepted by the 33rd ACM International Conference on Multimedia (ACMMM 2025). The project page is available at https://kita-wjx.github.io/MCV2M/

Via

Access Paper or Ask Questions

UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

Jul 09, 2025

Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen, Zheng Li, Xian Li(+2 more)

Figure 1 for UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

Figure 2 for UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

Figure 3 for UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

Figure 4 for UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

Abstract:The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.

* Accepted by ACL 2025 (main)

Via

Access Paper or Ask Questions

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Jun 11, 2025

Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang(+48 more)

Figure 1 for Ming-Omni: A Unified Multimodal Model for Perception and Generation

Figure 2 for Ming-Omni: A Unified Multimodal Model for Perception and Generation

Figure 3 for Ming-Omni: A Unified Multimodal Model for Perception and Generation

Figure 4 for Ming-Omni: A Unified Multimodal Model for Perception and Generation

Abstract:We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

* 18 pages,8 figures

Via

Access Paper or Ask Questions

Aligning Large Language Models with Implicit Preferences from User-Generated Content

Jun 04, 2025

Zhaoxuan Tan, Zheng Li, Tianyi Liu, Haodong Wang, Hyokun Yun, Ming Zeng, Pei Chen, Zhihan Zhang, Yifan Gao, Ruijie Wang(+3 more)

Figure 1 for Aligning Large Language Models with Implicit Preferences from User-Generated Content

Figure 2 for Aligning Large Language Models with Implicit Preferences from User-Generated Content

Figure 3 for Aligning Large Language Models with Implicit Preferences from User-Generated Content

Figure 4 for Aligning Large Language Models with Implicit Preferences from User-Generated Content

Abstract:Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/

* Accepted to ACL 2025 Main Conference

Via

Access Paper or Ask Questions

END: Early Noise Dropping for Efficient and Effective Context Denoising

Feb 26, 2025

Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu(+2 more)

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.

Via

Access Paper or Ask Questions

Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Feb 10, 2025

Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang(+9 more)

Figure 1 for Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Figure 2 for Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Figure 3 for Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Figure 4 for Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Abstract:Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.

* Accepted to NAACL 2025 main conference

Via

Access Paper or Ask Questions

GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

Jan 17, 2025

Heda Zuo, Weitao You, Junxian Wu, Shihong Ren, Pei Chen, Mingxu Zhou, Yujia Lu, Lingyun Sun

Abstract:Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.

* Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

Via

Access Paper or Ask Questions

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Dec 26, 2024

Dengming Zhang, Weitao You, Ziheng Liu, Lingyun Sun, Pei Chen

Figure 1 for Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Figure 2 for Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Figure 3 for Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Figure 4 for Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Abstract:Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.

* Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

Via

Access Paper or Ask Questions

GraphicsDreamer: Image to 3D Generation with Physical Consistency

Dec 18, 2024

Pei Chen, Fudong Wang, Yixuan Tong, Jingdong Chen, Ming Yang, Minghui Yang

Figure 1 for GraphicsDreamer: Image to 3D Generation with Physical Consistency

Figure 2 for GraphicsDreamer: Image to 3D Generation with Physical Consistency

Figure 3 for GraphicsDreamer: Image to 3D Generation with Physical Consistency

Figure 4 for GraphicsDreamer: Image to 3D Generation with Physical Consistency

Abstract:Recently, the surge of efficient and automated 3D AI-generated content (AIGC) methods has increasingly illuminated the path of transforming human imagination into complex 3D structures. However, the automated generation of 3D content is still significantly lags in industrial application. This gap exists because 3D modeling demands high-quality assets with sharp geometry, exquisite topology, and physically based rendering (PBR), among other criteria. To narrow the disparity between generated results and artists' expectations, we introduce GraphicsDreamer, a method for creating highly usable 3D meshes from single images. To better capture the geometry and material details, we integrate the PBR lighting equation into our cross-domain diffusion model, concurrently predicting multi-view color, normal, depth images, and PBR materials. In the geometry fusion stage, we continue to enforce the PBR constraints, ensuring that the generated 3D objects possess reliable texture details, supporting realistic relighting. Furthermore, our method incorporates topology optimization and fast UV unwrapping capabilities, allowing the 3D products to be seamlessly imported into graphics engines. Extensive experiments demonstrate that our model can produce high quality 3D assets in a reasonable time cost compared to previous methods.

Via

Access Paper or Ask Questions

Multi-Scale Cross-Fusion and Edge-Supervision Network for Image Splicing Localization

Dec 17, 2024

Yakun Niu, Pei Chen, Lei Zhang, Hongjian Yin, Qi Chang

Figure 1 for Multi-Scale Cross-Fusion and Edge-Supervision Network for Image Splicing Localization

Figure 2 for Multi-Scale Cross-Fusion and Edge-Supervision Network for Image Splicing Localization

Figure 3 for Multi-Scale Cross-Fusion and Edge-Supervision Network for Image Splicing Localization

Figure 4 for Multi-Scale Cross-Fusion and Edge-Supervision Network for Image Splicing Localization

Abstract:Image Splicing Localization (ISL) is a fundamental yet challenging task in digital forensics. Although current approaches have achieved promising performance, the edge information is insufficiently exploited, resulting in poor integrality and high false alarms. To tackle this problem, we propose a multi-scale cross-fusion and edge-supervision network for ISL. Specifically, our framework consists of three key steps: multi-scale features cross-fusion, edge mask prediction and edge-supervision localization. Firstly, we input the RGB image and its noise image into a segmentation network to learn multi-scale features, which are then aggregated via a cross-scale fusion followed by a cross-domain fusion to enhance feature representation. Secondly, we design an edge mask prediction module to effectively mine the reliable boundary artifacts. Finally, the cross-fused features and the reliable edge mask information are seamlessly integrated via an attention mechanism to incrementally supervise and facilitate model training. Extensive experiments on publicly available datasets demonstrate that our proposed method is superior to state-of-the-art schemes.

* 5 pages,3 figures

Via

Access Paper or Ask Questions